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APPEAL BRIEF 

Sir: 

Appellants hereby submit an original and two copies of this Appeal Brief to the Board of Patent 
Appeals and Interferences ("the Board") in response to the Final Office Action mailed on 
November 4, 2002. The Notice of Appeal was timely submitted on March 4, 2003, and was received in 
the Patent and Trademark Office ("the Office") on March 1 1 , 2003 . This Appeal Brief is timely submitted 
in light of the concurrently filed Petition for an Extension of Time of one month to and including 
June 11, 2003, and authorization to deduct the fee as required under 37 C.F.R. § 1.17(a)(1) from 
Appellants' Representatives' deposit account. The Commissioner is also authorized to charge the fee for 
filing this Appeal Brief ($160.00), as required under 37 C.F.R. § 1.17(c), to Lexicon Genetics 
Incorporated Deposit Account No. 50-0892. 

Appellants believe no fees in addition to the fee for filing the Appeal Brief and the fee for the 
extension of time are due in connection with this Appeal Brief. However, should any additional fees under 
37C.F.R. §§ 1.16' to 1.21 be required for any reason related to this communication, the Commissioner 
is authorized to charge any underpayment or credit any overpayment to Lexicon Genetics Incorporated 
Deposit Account No. 50-0892. 

I. REAL PARTY IN INTEREST 

The real party in interest is the Assignee, Lexicon Genetics Incorporated, 8800 Technology Forest 

Place, The Woodlands, Texas, 77381. 

II. RELATED APPEALS AND INTERFERENCES 

Appellants know of no related appeals or interferences that will directly affect or be directly 
affected by or have a bearing on the Board's decision in the pending appeal. 
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III. STATUS OF THE CLAIMS 

The present application was filed on October 1 1 , 200 1 , claiming the benefit of U.S . Provisional 
Application Number 60/239,592, which was filed on October 1 1 , 2000, and included original claims 1 
and 2. A Restriction andElection Requirement was issued on May 7, 2002, separating the original claims 
into sixteen separate and distinct inventions. In a response to the Restriction and Election Requirement 
submitted to the Office on June 3, 2002, Appellants elected with traverse the claim of the Group IV 
invention (original claim 1 (in part)) for prosecution on the merits, argued that the Group XII invention 
(original claim 2 (in part)) should be properly rejoined with the Group IV invention, and amended claims 
1 and 2 to remove reference to the Group I-UJ, V-XI and XU-XVI inventions. 

A First Official Action on the merits ("the First Action") was issued on July 8, 2002, in which the 
Examiner agreed that the Group XII invention should be rejoined with the Group IV invention, claims 1 
and 2 were rejected under 35 U.S.C. § 101 as allegedly lacking a patentable utility, and claims 1 and2 
were rejected under 35 U.S.C. § 1 12, first paragraph, as allegedly unusable by the skilled artisan due to 
the alleged lack of patentable utility. In a response to the First Official Action submitted to the Office on 
October 7, 2002 ("Response to the First Action"), Appellants addressed the rejections of claims 1 and 2, 

and added new claim 3. 

A Second and Final Official Action ("the Final Action") was mailed on November 4, 2002, 
maintaining the rejection of claims 1-3 under 35 U.S.C. § 101 as allegedly lacking a patentable utility, and 
under 35 U.S.C. § 1 12, first paragraph as allegedly unusable by the skilled artisan due to the alleged lack 
of patentable utility. In a response to the Second and Final Office Action submitted on March 4, 2003 
("Response to the Final Action"), Appellants again addressed the rejections of claims 1-3. An Advisory 
Action ("the Advisory Action") was mailed on April 15, 2003, maintaining the rejection of claims 1 -3 under 
35U.S.C § 101 as allegedly lacking a patentable utility, and under 35 U.S.C. § 11 2, first paragraph as 
allegedly unusable by the skilled artisan due to the alleged lack of patentable utility. Therefore, claims 1 -3 
are the subject of this appeal. A copy of the appealed claims are included below in the Appendix 
(Section IX). 




IV. STATUS OF THE AMENDMENTS 

As no amendments subsequent to the Final Action have been filed, Appellants believe that no 

outstanding amendments exist. 

V. SUMMARY OF THE INVENTION 

The present invention relates to Appellants' discovery and identification of novel human 
polynucleotide sequences that encode novel seven transmembrane protein receptor proteins, specifically 
G-protein coupled receptors (GPCRs) (specification at page 2, lines 8- 1 1 ). GPCRs have been associated 
with transduction pathways involving G-proteins or PPG proteins (specification at page 2, lines 2-3). 

The presently claimed polynucleotide sequences were compiled from human genomic sequences 
in conjunction with cDNAs generated from human skeletal muscle, testis, and spleen mRNAs (specification 
at page 5, lines 28-30). 

The specification details a number of uses for the presently claimed polynucleotide sequences, 
including in diagnostic assays such as forensic analysis (see, for example, the specification at page 4, line 
3 1 , page 36, lines 30-3 1 , and page 38, lines 4-5), in assessing gene expression patterns, particularly using 
a high throughput "chip" format (see, for example, the specification at page 10, lines 13-16), and in 
mapping the sequences to a specific region of a human chromosome (see, for example, the specification 
at page 4, lines 24-26). 

VI. ISSUES ON APPEAL 

1. Do claims 1-3 lack a patentable utility? 

2. Are claims 1-3 unusable by a skilled artisan due to a lack of patentable utility? 

VII. GROUPING OF THE CLAIMS 

For the purposes ofthe outstanding rejections under 35 U.S. C. § 101 and35U.S.C. § 112, first 
paragraph, the claims will stand or fall together. 
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VIII. ARGUMENT 

A. Do Claims 1-3 Lack a Patentable Utility? 

TheFinal Action first rejects claims 1-3 under 35 U.S. C. § 101, as allegedly lacking a patentable 
utility due to not being supported by either a specific and substantial or a well-established utility. 

Appellants pointed out in both the Response to the First Action and the Response to the Final 
Action that as just one example of the utility of the presently claimed sequences, the present nucleic acid 
sequences have utility in forensic analysis, as described in the specification as originally filed (see, for 
example, page 4, Une 31, page 36, lines 30-31, and page 38, lines 4-5). As described in the specification 
at page 8, lines 25-28, the present sequences define a coding single nucleotide polymorphism - specifically, 
a G/A polymorphism at position 146 of SEQ ID NO:8, which can lead to a serine or asparagine residue 
at amino acid position 49 of SEQ ID NO:9. As such polymorphisms are the basis for forensic analysis, 
which in undoubtedly a "real world" utility, the presently claimed sequences must in themselves be useful. 

The Advisory Action states that the use of the presently described polymorphism in forensic 
analysis, would require "further research", because "the instant disclosure fails to disclose the population 
that polymorphic marker distinguishes" (the Advisory Action at page 2). Appellants submit that the 
presently described polymorphism is useful in forensic analysis exactly as it was described in the 
c p^fi^tinn^nrimnallv filed . Individual members of a population can be distinguished based on the 
presence or absence of the described polymorphism, and thus, these sequences are useful without 
"additional research". Simply because the use of this polymorphic marker will necessarily provide 
aMtipMinformationonmepercentageofpam^ 

does not mean that "additional research" is needed in order for this marker as it is presently described in 
the instant specification to be of use to forensic science. Thus, the Examiner' s position does not support 

the alleged lack of utility. 

This is also not a case of a potential utility. As stated above, using the presently described 
polymorphic marker as described in the specification as originally field will definitely distinguish members 
of a population from one another. In the worstcase scenario, this marker is useful to distinguish 50% of 
the population (in omer words, me rna^ ^ 
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50% of the population from a forensic analysis clearly is a real world, practical utility. In the Advisory 
Action, the Examiner states that the use of the presently described polymorphic marker in forensic analysis 
is not "specific and substantial" (the Advisory Action at page 2). Appellants are completely at a loss to 
understand how, given the widespread and daily use of forensic analysis to distinguish individuals, the use 
of a polymorphic marker in forensic analysis is not a "substantial" use. With regard to the allegation that 
the use of the presently described polymorphism in forensic analysis is not a "specific" use, as set forth in 
the Response to the Final Action, Appellants submit that this is improper on a number of different grounds. 
First, and most importantly, the Final Action seems to be confusing the requirements of a specific utility 
with a unique utility. The fact that other polymorphic markers have been identified in other genetic loci 
does not mean that AppeU^^ 

As clearly stated by the Federal Circuitin Carl Zeiss Stiftung v. Renishaw PLC, 20 USPQ2d 1101 (Fed. 
Cir. 1991): 

An invention need not be the best or only way to accomplish a certain result, and it need 
only be useful to some extent and in certain applications: "[T]he fact that an invention has 
only limited utility and is only operable in certain applications is not grounds for finding a 
lack of utility." Envirotech Corp. v. Al George, Inc. , 22 1 USPQ 473, 480 (Fed. Cir. 
1984) 

Just because other polymorphic sequences from the human genome have been described does not mean 
that the use of the presently described polymorphic markers for forensic analysis is not a specific utility. 
The requirement for asEecifjc utility, which is the proper standard for utility under 35 U.S.C. § 101, should 
not be confused with the requirement for a unique utility, which is clearly an improper standard. If every 
invention were required to have a unique utility, the Patent and Trademark Office would no longer be 
issuing patents on batteries, automobile tires, golf balls, golf clubs, and treatments for a variety of human 
diseases, just to name a few particular examples, because examples of each of these have already been 
described and patented. However, only the briefest perusal of any issue of the Official Gazette provides 
numerous examples of patents being granted on each of the above compositions every week . Furthermore, 
if a composition needed to be unique to be patented, the entire class and subclass system would be an 
effort in futility, as the class and subclass system serves solely to group such common inventions, which 
would not be required if each invention needed to have a unique utility. Thus, the present sequence clearly 





meets the requirements of 35 U.S.C. § 101. 

Second, Appellants submit that the asserted forensic utility is specific because it cannot be applied 
tojust any nucleic acid. In fact, the basis for forensic analysis is the fact that such a polymorphic marker 
is not present in all other nucleic acids, but in fact specific and unique to only a certain subset of the 
population. As such, the presently described polymorphic marker clearly has a specific utility, and 
therefore the presently claimed invention must meet the requirements for utility under 35 U.S.C. § 101 

It is important to note that it has been clearly established that a statement of utility in a specification 
must be accepted absent reasons why one skilledintheart would have reason to doubt the objective truth 
of such statement. In reLanger, 503 F.2d 1380, 1391, 183 USPQ 288, 297 (CCPA, 1974; "Longer"); 
In re Marzocchi, 439 F.2d 220, 224, 169 USPQ 367, 370 (CCPA, 1971). As clearly set forth in 
Longer: 

As a matter of Patent Office practice, a specification which contains a disclosure of utility 
which corresponds in scope to the subject matter sought to be patented must be taken as 
sufficient to satisfy the utility requirement of § 101 for the entire claimed subject matter 
unless there is a reason for one skilled in the art to question the objective truth of the 
statement of utility or its scope. 

Longer at 297, emphasis in original. As set forth in the MPEP, "Office personnel must provide evidence 
sufficient to show that the statement of asserted utility would be considered 'false' by a person of ordinary 
skill in the art" (MPEP, Eighth Edition at 2100-40, emphasis added). Thus, the present claims clearly meet 

the requirements of 35 U.S.C. § 101. 

Furthermore, as the presently described polymorphism is a part of the family of polymorphisms that 
have a well established utility, the Federal Circuit's holding in In re Brana, (34 USPQ2d 1436 (Fed. Cir. 
1995), "Brana") is directly on point. In Brana, theFederal Circuit admonished the Patent andTrademark 
Office for confusing "the requirements under the law for obtaining a patent with the requirements for 
obtaining government approval to market a particular drug for human consumption". Brana at 1442. The 

Federal Circuit went on to state: 

At issue in this case is an important question of the legal constraints on patent office 
examination practice and policy. The question is, with regard to pharmaceutical inventions, 
what must the applicant provide regarding the practical utility or usefulness of the invention 
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for which patent protection is sought. This is not anew issue; it is one which we would 
have rhoupht had heen settled hy rasp, law years ago. 

Brana at 1439, emphasis added. The choice of the phrase -utility or usefulness" in the foregoing quotation 
is highly pertinent. The Federal Circuit is evidently using "utility" to refer to rejections under 
35U.S.C. § 101, and is using "usefulness" to referto rejections under 35U.S.C. § 112, first paragraph. 
This is made evident in the continuing text in Brana, which explains the correlation between 35 U.S.C. 
§§ 101 and 1 12, first paragraph. The Federal Circuit concluded: 

FDA approval , however, is not a prerequisite for finding a compound useful within the 
meaning of the patent laws. Usefulness in patent law, and in particular in the context of 
pharmaceutical inventions, n^sirilv includes the expectation of further research and 
development . The stage at which an invention in this field becomes useful is well before 
it is ready to be administered to humans. Were we to require Phase H testing in order to 
prove utility, the associated costs would prevent many companies from obtaining patent 
protection on promising new inventions, thereby eliminating an incentive to pursue, through 
research and development, potential cures in many crucial areas such as the treatment of 



cancer. 



Bnma-Bt 1442-1443, citations omitted, emphasis added. As set forth above, the present polymorphisms 
are useful in forensic analysis exactly as they are described in the specification as originally filed, without 
the need for any further research. Even if the use of these polymorphic markers provided additional 
irfomauon on me percentage of 

not mean that "additional research" is needed in order for this marker as it is presently described in the 
instant specification to be of use to forensic science. As stated above, using the polymorphic marker as 
describedinmespecificationasori 

another. However, even if , arguendo, further research might be required in certajn aspects of the present 
invention, this does not preclude a finding that the invention has utility, as set forth by the Federal Circuit's 
holding in Brana, which clearly states, as highlighted in the quote above, that "pharmaceutical inventions, 
necessarily includes the expectation of further research and development" (Brana at 1442-1443, emphasis 
added). In assessing me question of whemer undue experimentation would be required in order to practi^^ 
theclaimedinvention, the key term is "undue", not "experimentation". In re Angstadt and Griffin, 190 
USPQ 214 (CCPA 1976). The need for some experimentation does not render the claimed invention 
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unpatentable. Indeed, a considerable amount of experimentation may be permissible if such 
experimentation is routinely practiced in the art. In re Angstadt and Griffin, supra; Amgen, Inc. v. 
Chugai Pharmaceutical Co., Ltd., 18 USPQ2d 1016 (Fed. Cir. 1991). As a matter of law, it is well 
settled that a patent need not disclose what is well known in the art. In re Wands, 8 USPQ 2d 1400 (Fed. 
Cir. 1988). 

Although Appellants need only make one credible assertion of utility to meet the requirements of 
35 U.S.C. § 101 {Raytheon v. Roper, 220 USPQ 592 (Fed. Cir. 1983); In re Gottlieb, 140 USPQ 665 
(CCPA 1964); In re Malachowski, 189 USPQ 432 (CCPA l916);Hoffinanv. Klaus, 9 USPQ2d 1657 
(Bd. Pat. App. & Inter. 1988)), Appellants pointed out in both the Response to the First Action and the 
Response to the Final Action that a sequence sharing over 99% percent identity at the amino acid level over 
the entire length of the described sequence is present in the leading scientific repository for biological 
sequence data (GenBank), and has been annotated by third party scientists wholly unaffiliated with 
Appellants as "Homo sapiens gene for seven transmembrane helix receptor" (GenBank accession number 
AB065623; alignment provided in Exhibit A). The Final Action stated that "there is no sufficient and 
credible information that indicates the published sequence is a functional GPCR" (the Final Action at page 
3). Appellants pointed out in the Response to the Final Action that an additional sequence sharing over 
99% percent identity at the amino acid level over the entire length of the described sequence is present in 
the leading scientific repository for biological sequence data (GenBank), and has been annotated by 
different third party scientists wholly unaffiliated with Appellants as a "G-protein coupled receptor" 
(GenBank accession number BD144530; alignment provided in Exhibit B). The legal test for utility simply 
involves an assessment of whether those skilled in the art would find any of the utilities described for the 
invention to be credible or believable. Given these two GenBank annotations, there can be no question 
that those skilled in the art would clearly believe that Appellants' sequence is a G-protein coupled receptor. 
Thus, the present sequence clearly meets the requirements of 35 U.S.C. § 101. 

The First Action denied that the extensive homology between Appellants' sequence and those 
sequences presented above confers a patentable utility to Appellants sequence, by questioning prediction 
of protein function based upon protein homology. In support of this allegation, the First Action cited Bork 



and Koonin (1998, Nature Genetics 75:313-3 18; "Bork and Koonin"), Ji etal. (1998, J. Biol. Chem. 
273: 17299-17302; "Ji") and Yan et al. (2000, Science 290:523-527; "Yan"). While these arguments 
were not set forth in either the Final Action or the Advisory Action, Appellants will again set forth the 
shortcomings of these articles, and point out the failure of these articles to support the alleged lack of utility 

of the presently claimed sequence. 

First, with regard to the Bork and Koonin article, Bork and Koonin themselves conclude "(i)n 
summary, the currently available methods for sequence analysis are sophisticated, and while further 
improvements will certainly ensue, they are already capable of extracting subtle but functionally relevant 
signals from protein sequences (Bork and Koonin, page 317). Thus, the Bork and Koonin article is hardly 
indicative of a high level of uncertainty in assigning function based on sequence, and thus does not support 

the alleged lack of utility. 

With regard to Ji, an exact quote from Ji completely undermines the question of asserted utility 
based upon protein homology: "a substantial degree of amino acid homology is found between members 
ofaparticularsubfamily.butcomparisons between 

at 17299, first paragraph, emphasis added). This quote suggests that homology with members of a 
G-protein coupled receptor is indicative that the particular sequence is in fact a member of that subfamily - 
the fact that there is little or no homology between subfamilies is completely irrelevant. Thus, Ji does not 

support the alleged lack of utility. 

Furthermore, regarding Yan, this paper cites only one example, two isoforms of the anhidrotic 
ectodermal dysplasia (EDA) gene, where a two amino acid change conforms one isof orm (ED A- A 1) into 
the second isoform (ED A-A2). However, while it is true that this amino acid change results in binding to 
differentreceptors.itisimporta^^^^ 

related (Yan at page 523). Furthermore, the ED A-A2 receptor was correctly identified as a member of 
the tumor necrosis factor receptor superfamily based solely on sequence similarity (Yan at page 523). 
Thus, Yan does not suggest a high level of uncertainty in assigning function based on sequence, and thus 
also does not support the alleged lack of utility. 

Rather, with regard to the utility of the presently claimed sequence, as 60% of the pharmaceutical 



* 
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products currently being market by the entire industry target G-protein coupled receptors (Gurrath, 2001 , 
Curr. Med. Chem. 8: 1605-1648; abstract presented in Exhibit C), a preponderance of the evidence 
clearly weighs in favor of Appellants' assertion that the skilled artisan would readily recognize that the 
presently described sequences have a specific (the claimed GPCR proteins are encoded by a specific locus 
on the human genome, see below), credible, and well-established utility, for example in tracking gene 
expression, as described in the specification as originally filed, at least at page 10, lines 13-16. In 
particular, the specification describes how the described sequences can be represented using a gene chip 
format to provide a high throughput analysis of the level of gene expression. Such "DNA chips" clearly 
have utility, as evidenced by hundreds of issued U.S. Patents, as exemplified by U.S. Patent Nos. 
5,445,934 (Exhibit D), 5,556,752 (Exhibit E), 5,744,305 (Exhibit F), 5,837,832 (Exhibit G), 
6,156,501 (ExhibitH)and6,261,776(ExhibitI). Evidenceof the "real world" subsjaniial utility of the 
present invention is furmer pro^ 

of gene sequences or fragments thereof in a gene chip format. Perhaps the most notable gene chip 
company is Affymetrix. However, there are many companies that have, at one time or another, 
concentrated on the use of gene sequences or fragments, in gene chip and non-gene chip formats, for 
example: Gene Logic, ABI-Perkin-Elmer, HySeq and Incyte. In addition, one such company (Rosetta 
Inpharmatics) was viewed to have such "real world" value that it was acquired by large a pharmaceutical 
company (Merck) for significant sums of money (net equity value of the transaction was $620 million). The 
"real world" substantial industrial utility of gene sequences or fragments would, therefore, appear to be 
widespread and well established. Clearly, there can be no doubt that the skilled artisan would know how 
to use the presently claimed sequences (see Section Vffl(B), below), strongly arguing that the claimed 
sequenceshave utility. Given the widespread utiUty of such "gene chip" mem^ 
sequence information, there can be little doubt that the use of the presently described novel sequences 
wouldhavegreatutilityinsuchDNAchipappUcations. As the present sequences are specific markers of 
the human genome (see below), and such s pecific markers are targets for the discovery of drugs that are 
associated with human disease, those of skill in the art would instandy recognize that the present nucleotide 
sequences wouldbe ideal, novel candidates for assessing gene expression using such DNA chips. Clearly, 
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compositions that enhance the utility of such DNA chips, such as the presently claimed nucleotide 
sequences, must in themselves be useful. Thus, the present claims clearly meet the requirements of 
35 U.S.C. § 101. 

The Final Action questioned this utility, stating "(s)ince the disclosure does not reveal any 
activity/functions of the nucleotide sequence or the protein encodedby the nucleotide sequence, one skilled 
in the art would not know how to use the claimed invention" (the Final Action at page 5). However, this 
argumentisthwartedbythefactthatskilled artisans alreadyhjyeusedand continuetouse sequences such 
as Appellants in gene chip applications. Appellants respectfully point out that tWs is exactly how most gene 
chip applications arecarried out. Expression profiling does not require a lcnow^ 
particular nucleic acid on the chip - rather the gene chip indicates which DNA fragments are expressed at 
greater or lesser levels in two or more particular tissue types. Therefore, this argument also fails to support 
the alleged lack of utility of the presently claimed compositions. 

Clearly, persons of skill in the art, as well as venture capitalists and investors, readily recognize the 
utility, both scientific and commercial, of genomic data in general, and specifically human genomic data. 
Billions of dollars have been invested in the human genome project, resulting in useful genomic data (see, 
e.g., Venter era/.,2001,Science29i:1304;Exhibit J). The results have been a stunning success as the 

utility of human genormc data has be^ 

Kennedy, 2001, Science 291 : 1 153; Exhibit K). Clearly, the usefulness of human genomic data, such as 

mepresentlyclaimednuclei^ 

mecn^onofnumerouscomp^ 

genomic information has been clearly understood for many years). 

As yet afurther example of me util^ 
specificationatleasta^^^ 

the sequences to a specific region of a human chromosome. This is evidenced by the fact that SEQ ED 
NO:8 can be used to map the presently claimed sequence to chromosome 1 (present within the 
chromosomelclone,GenbankAccessionNumberAC091612;alignmentandthefirstpagefromthe 

Genbank report are presented in Exhibit L). Clearly, the present polynucleotide provides exquisite 
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spedfidtyinloc^ 
polynucleotide^ 

that makes this particular sequence so useful. Early gene mapping techniques relied on methods such as 
Giemsa staining to identify regions of chromosomes. However, such techniques produced genetic maps 
with a resolution of only 5 to 10 megabases, far too low to be of much help in identifying specific genes 
involved disease. Thesm^ 

map a specific locus of the human genome, such as the present nucleic acid sequence. For further evidence 
in support of the Appellants' position, theBoard is requested to review, for example, section 3 of Venter 
e ,a/.(^raarpp.l3n 

significance of expressed sequence information in the structural analysis of genomic data. The presently 
claimed polynucleotide sequence defines a biologically validated sequence that provides a unique and 
specific resource for mapping the genome essentially as described in the Venter et al. article Thus, the 
present claims clearly meet the requirements of 35 U.S.C. § 101. 

Appellants respectfully remind the Board that only a minor percentage of the genome (2-4%) 

actuallyencodesexons^ 
sequenceprovidesbjoloacall^ 

polyadenylated) that ^ca//v define mat portion of the corresponding genomic locus tha^ 
encodes exon sequence. Appellants respectfully submit that the practical scientific value of expressed and 



mRNA 



biochemical arts. Thus, the present claims clearly meet the requirements of 35 U.S.C. § 101. 

Regardingthe utility requirements under 35 U.S. C.§ 101, the Federal Circuit has clearly stated 
"(t)hethresholdof utility is not high: An invention is 'useful' undersection 101 if it is capable of providing 
some identifiable benefit." Juicy Whip Inc. v. Orange Bang Inc., 185 F.3d 1364, 5 1 USPQ2d 1700 
(Fed. Cir. 1999) (citing Brenner v. Manson, 383 U.S. 519, 534 (1966)). Additionally, theFederal Circuit 
hasstatedmat'Woviol^ 

Brooktree Corp. v. Advanced Micro Devices, Inc., 977 F.2d 1555, 157 1 , 24 USPQ2d 1401 (Fed. Cir. 
1992), emphasis added. Cross v. lizuka (753 F.2d 1040, 224 USPQ 739 (Fed. Cir. 1985); "Cross") 
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states "anyutility of theclaimed compounds is sufficient to satisfy 35U.S.C. § 101". Cross at 748, 
emphasis added. Indeed, the Federal Circuit recently emphatically confirmed that "anything under the sun 
that is made by man" is patentable (State Street Bank & Trust Co. v. Signature Financial Group Inc. , 
149 F.3d 1368, 47 USPQ2d 1596, 1600 (Fed. Cir. 1998), citing the U.S . Supreme Court's decision in 
Diamond vs. Chakrabarty, 447 U.S. 303, 206 USPQ 193 (U.S., 1980)). Thus, based on the relevant 
case law, the present claims clearly meet the requirements of 35 U.S.C. § 101. 

The Final Action also questioned the applicability of this case law, stating that "the Response cites 
adevicecaselaw"and"(t)hus,applicants'argumentcitingaca S elawregardingadevic^ 

the instant case" (the Final Action at page 3). Section 101 of the Patent Act of 1952, 35 U.S.C. § 101, 
provides that "[w]hoever invents or discovers any new and useful process, machine, manufacture, or 
composiuonofma tt er,or^^ 

or discovery. Appellants point out that 35 U.S.C. § 101 covers devices (machines) as well as 
compositions, and makes no distinction between the two with regard to meeting the burden of complying 
with 35 U.S.C. § 101. Furthermore, the case law in question (Juicy Whip Inc. v. Orange Bang Inc., 
51 USPQ2d 1700 (Fed. Cir. 1999)) cites Brenner v. Manson, 383 U.S. 519, 534 (1966), which the 
F^anninerobyiouslv believes is not "irrelevant to the instant case^ since 

case two times in the Final Action (see the Final Action at pages 3 and 5). Additionally, Cross and 
Diamondvs. Chakrabarty, supra, do not concern devices, but rather compositions. Thus, this argument 
completely fails to support the alleged lack of utility of the presently claimed compositions. 

Finally, While Appellants are well aware of the new Utility Guidelines set forth by the USPTO, 
Appellants respectfully point out that the current mles and regulations regarding me examination of patent 
applications is and always has been the patent laws as set forth in 35 U.S.C. and the patent rules as set 
forth in 37 C.F.R., not the Manual of Patent Examination Procedure or particular guidelines for patent 
examination set forth by the USPTO. Furthermore, it is the job of the judiciary, not the USPTO, to 
interpret these laws and rules. Appellants are unaware of any significant recent changes in either 
35U.S.C.§101,orintheinterpretationof35U.S.C.§101bymeSupremeCourtortheFedera^ 

matisinkeepingwim^ 
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patents that have been issued over the years that claim nucleic acid fragments that do not comply with the 
new Utility Guidelines. As examples of such issued U.S. Patents, the Board is invited to review U.S. Patent 
Nos. 5,8 17,479 (Exhibit M), 5,654,173 (Exhibit N), and 5,552,281 (Exhibit O; each of which claims 
short polynucleotides), and recently issuedU.S. Patent No. 6,340,583 (Exhibit P; which includes no 
working examples), none of which contain examples of the "real- world" utilities that the Examiner seems 
to be requiring. Additionally, the Office has recently issued U.S . Patent 6,043,052 (Exhibit Q), which 
concerns an "orphan" G-Protein coupled receptor identified based only on homology to the orphan 
receptor GPR25, similar to the situation with Appellants' currently claimed sequence. Importantly, this 
issued patent also contains no examples of the "real world" utilities seemingly required in the present case. 
As issued U.S. Patents are presumed to meet aH of the requirements for patentability, including 
35 U.S.C. §§ 101 and 112, first paragraph (see Section VHI(B), below), Appellants submit that the 
present polynucleotides must also meet the requirements of 35 U.S.C. § 101 . While Appellants understand 
that each application is examined on its own merits, Appellants are unaware of any changes to 
35U.S.C. § lOl.orinmeinterpretationon^ 

since the issuance of these patents that render the subject matter claimed in these patents, which is similar 
to the subject matter in question in the present application, as suddenly non-statutory or failing to meet the 
requirements of 35 U.S.C. § 101. Thus, holding Appellants to a gUfferent standard of utility would be 
arbitrary and capricious, and, like other clear violations of due process, cannot stand. 

For each of the foregoing reasons, Appellants submit that the rejection of claims 1-3 under 
35 U.S.C. § 101 must be overruled. 

B. Are Claims 1-3 Unusable Due to a Lack of Patentable Utility? 

The Final Action next rejects claims 1 -3 under 35 U.S.G § 1 12, first paragraph, since allegedly 
one skilled in the art would not know how to use the invention, as the invention allegedly is not supported 
by either a clear asserted utility or a well-established utility. 

The arguments detailed above in Section Vm(A) concerning the utility of the presently claimed 
sequences are incorporated herein by reference. As the Federal Circuit and its predecessor have 
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determined that the utility requirement of Section 101 and the how to use requirement of Section 112, first 
paragraph, have the same basis, specifically the disclosure of a credible utility (In re Brana, supra; In re 
Jolles, 628 F.2d 1322, 1326 n.l 1,206 USPQ885, 889 n. 11 (CCPA 1980); In re Fouche, 439 F.2d 
1237, 1243 , 169 USPQ 429, 434 (CCPA 1 97 1 )), Appellants submit that as claims 1 -3 have been shown 
to have "a specific, substantial, and credible utility", as detailed in Section Vm(A) above, the present 
rejection of claims 1-3 under 35 U.S.C. § 112, first paragraph, cannot stand. 

Appellants therefore submit that the rejection of claims 1-3 under 35 U.S.C. § 112, first paragraph, 
must be overruled. 



IX. APPENDIX 

The claims involved in this appeal are as follows: 

1 . (Amended) An isolated expression vector comprising the nucleotide sequence of SEQ ID 

NO:8. 

2. (Amended) An isolated expression vector comprising a nucleic acid sequence encoding the 
amino acid sequence of SEQ ID NO:9. 

3. A host cell comprising the recombinant expression vector of claim 1 or 2. 
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X. CONCLUSION 

Appellants respectfully submit that, in light of the foregoing arguments, the Final Action's conclusion 
that claims 1-3 lack a patentable utility and are unusable by the skilled artisan due to a lack of patentable 
utility is unwarranted. It is therefore requested that the Board overturn the Final Action's rejections. 

Respectfully submitted, 



David W. Hibler Reg. No. 41,071 

Agent For Appellants 

LEXICON GENETICS INCORPORATED 
8800 Technology Forest Place 
The Woodlands, TX 77381 
(281) 863-3399 
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ogy. A peptide is a sequence of amino acids. When the 
ARRAY OF OLIGONUCLEOTIDES ON A SOLID twenty naturally occurring amino acids are condensed 

SUBSTRATE into polymeric molecules they form a wide variety of 

three-dimensional configurations, each resulting from a 
CROSS-REFERENCE TO RELATED 5 particular amino acid sequence and solvent condition. 

APPLICATIONS The number of possible pentapeptides of the 20 nam- 

This application is a Rule 60 Division of U.S. applica- occurring amino acids, for example, is 20 5 or 3.2 

tion Ser. No. 850,356, filed Mar. 12, 1992, which is a million different peptides. The likelihood that molecules 
Rule 60 Division of U.S. application Ser. No. 492,462, of this size might be useful in receptor-binding studies is 
filed Mar. 7, 1990, now U.S. Pat No. 5,143,854, which 10 supported by epitope analysis studies showing that 
is a Continuation-in-Part of U.S. application Ser. No. some antibodies recognize sequences as snort as a few 
362,901, filed Jun. 7, 1989, now abandoned, all assigned amino acids with high specificity. Furthermore, the 
to the assignee of the present invention. " average molecular weight of amino acids puts small 

The file of this patent contains drawings executed in peptides in the size range of many currently useful phar- 
color. Copies of this patent with color drawings will be 13 maceun'cal products. 

provided by the Patent and Trademark Office upon Pharmaceutical drug discovery is one type of re- 
request and payment of the necessary fee. search which relies on such a study of strucrnre-activity 
COPYRIGHT NOTICE relationships. In most cases, contemporary pharmaceu- 
tical research can be described as the process of discov- 
A portion of the disclosure of this patent document 20 cring novel ligands with desirable patterns of specificity 
contains material which is subject to copyright protec- f or biologically important receptors. Another example 
tion. The copyright owner has no objection to the fac- ^ reS earch to discover new compounds for use in agri- 
simile reproduction by anyone of the patent document culture, such as pesticides and herbicides, 
or the patent disclosure as h appears in the Patent and Sometimes, the solution to a rational process of de- 
Trademark Office patent file or records, but otherwise 25 fi ^ fc or ony i cldin g ( prfor methods 
reserves all copyright rights whatsoever. 0 f preparing large numbers of different polymers have 
BACKGROUND OF THE INVENTION been painstakingly slow when used at a scale sufficient 
_ . , . , to permit effective rational or random screening. For 
The present inventions relate to the synthesis and ^e "Merrifield" method </. Am. Cherru Sac. 
placement of matenak at known locations. In partial- 30 * ^ H9 ^ 2l$4 which k in COri>or ated herdn by 

T^fn * n m f VCntK)nS ? T0 ^ a a reference for all purposes) has been used to synthesize 
method and associated apparatus for preparing diverse ^ , .f, *V^. > vr—.-«.w 

chemical sequences at known locations cm a single sub- ™ hdcS on * ?** ^ r 

strate surface. The inventions may be applied, forexam- ™* CTd * covalentiy bonded to a support made of 

pie, in the field of preparation of oligomer, peptide, 35 ****** P 01 ^' ^ 

nucleic acid, oHgosa^charide, phospholipid, poller, protectedgroup is reacted vntl i the ^ovalendy bonded 
or drug congener preparation; especially tT create ?*d to form a dipeptide^ After washing, the 

sources of chemical diversity for use in screening for Protective group is removed and a third ammo acid 
biological activity • 311 ^P ha protectee group is added to the dipep- 

The relationship between structure and activity of 40 tide - ^ P rocess * continued until a peptide of a de- 
molecules is a fundamental issue in the study of biologi- tog* and sequence is obtained. Using the Mem- 
cal systems. Structure-activity relationships are impor- fidd method, it is not economically practical to synthe- 
tant in understanding, for example, the function of en- mOTC a handful of peptide sequences in a day. 
zymes, the ways in which cells communicate with each To synthesize larger numbers of polymer sequences, 
other, as well as cellular control and feedback systems. 45 ^ has ** so becTl proposed to use a series of reaction 
Certain macromolecules are known to interact and vessels for polymer synthesis. For example, a tubular 
bind to other molecules having a very specific three-di- reactor system may be used to synthesize a linear poly- 
mensional spatial and electronic distribution. Any large mer on a solid phase support by automated sequential 
molecule having such specificity can be considered a addition of reagents. This method still does not enable 
receptor, whether it is an enzyme catalyzing hydrolysis 50 the synthesis of a sufficiently large number of polymer 
of a metabolic intermediate, a cell-surface protein medi- sequences for effective economical screening. • . 
a ting membrane transport of ions, a glycoprotein serv- Methods of preparing a plurality of polymer sequen- 
ing to identify a particular cell to its neighbors, an IgG- ces are also known in which a porous container encloses 
class antibody circulating in the plasma, an oligonucleo- a known quantity of reactive particles, the particles 
tide sequence of DNA in the nucleus, or the like. The 55 being larger in size than pores of the container. The 
various molecules which receptors selectively bind are containers may be selectively reacted with desired ma- 
known as ligands. terials to synthesize desired sequences of product mole- 
Many assays axe available for measuring the binding cules. As with other methods known in the art, this 
affinity of known receptors and ligands, but the infor- method cannot practically be used to synthesize a suffi- 
mation which can be gained from such experiments is 60 cient variety of polypeptides for effective screening, 
often limited by the number and type of ligands which Other techniques have also been described. These 
are available. Novel ligands are sometimes discovered methods include the synthesis of peptides on 96 plastic 
by chance or by application of new techniques for the pins which fit the format of standard microtiter plates, 
elucidation of molecular structure, including x-ray crys- Unfortunately, while these techniques have been some- 
tallographic analysis and recombinant genetic tech- 65 what useful, substantial problems remain. For example, 
niques for proteins. these methods continue to be limited in the diversity of 
Small peptides are an exemplary system for exploring sequences which can be economically synthesized and 
the relationship between structure and function in biol- screened. 
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From the above, it is seen that an improved method mask is placed on or focused on the substrate and illumi- 

and apparatus for synthesizing a variety of chemical nated so as to deprotect selected regions of the substrate 

sequences at known locations is desired. in the reactor space. A monomer is pumped through the 

cTTim * Wv /■«- tot? TxnrrxmAxi reactor space or otherwise contacted with the substrate 

SUMMARY OF THE INVENTION 5 and reactTwitb the dcprotected regions. By selectively 

An improved method and apparatus for the prepara- deprotecting regions on the substrate and flowing pre- 

tion of a variety of polymers is disclosed. determined monomers through the reactor space, de- 

In one preferred embodiment, linker molecules are sired polymers at known locations may be synthesized, 

provided on a substrate. A terminal end of the linker Improved detection apparatus and methods are also 

molecules is provided with a reactive functional group 10 disclosed. The detection method and apparatus utilize a 

protected with a photoremovable protective group. substrate having a large variety of polymer sequences at 

Using lithographic methods, the photoremovable pro- known locations on a surface thereof; The substrate is 

tecti ve group is exposed to light and removed from the exposed to a fluorescently labeled receptor which binds 

linker molecules in first selected regions. The substrate t0 onc or more 0 fthe polymer sequences. The substrate 

is then washed or otherwise contacted with a first mon- 15 ^ m a microscope detection apparatus for identi- 

omer that reacts with exposed functional groups on the fiction of locations where binding takes place. The 

linker molecules. In a preferred embodiment, the mono- microscope detection apparatus includes a monochrc- 

mer is an amino acid containing a photoremovable pro- m ^ Q or polychromatic light source for directing light 

tective group at its ammo or carboxy terminus and the at ^ for detecting fluoresced light 

linker molecule terminates m an amino or carboxy acid 20 from ±t ^tetoLtt, Md means for determining a loca- 

tion of the fluoresced light The means for detecting 
A second set of selected regwns is, thereafter, ex- ^ fluorcsced on ^ substrate may in some embodi- 
posed to hght and the photoremovable protective group ^ a ^ oton ^ means for deter . 

on the taker molecule/protected lammo and Is re- ^. location of the fluoresced light may include an 
moved at the second set of regions. The substrate is then 25 , ~T~ . . , - _ . J~J~ 6 ?._ f*7. 

. ... ... , ... j/y translation table for the substrate. Translation of the 

contacted with a second monomer containing a ,-j . . , , , 

photoremovable protective group for reaction with m * P° UeCtl0n M T* 5 * 

exposed functional groups. This process is repeated to ^ropnately programmed digital computer, 
selectively apply monomers until polymers ofVdesired ***** understanding of the nature and advantages 
length and desired chemical sequence are obtained. 30 of the inventions herein may be reaped by reference to 
Photolabile groups are then optionally removed and the the jeniaining portions of the specification and the at- 
sequence is, thereafter, optionally capped. Side chain Uchcd drawu3 ^ 

protective groups, if present, are also removed. BRIEF DESCRIPTION OF THE FIGURES 

By using the lithographic techniques disclosed 
herein, it is possible to direct light to relatively small 35 nG - 1 mustrates masking and irradiation of a sub- 
and precisely known locations on the substrate. It is, stra ^ e at a ^ location - Tne substrate is shown in cross- 
therefore, possible to synthesize polymers, of a known section; 

chemical sequence at known locations on the substrate. FIG - 2 iHustrates the substrate after application of a 

The resulting substrate will have a variety of uses monomer "A"; 
including, for example, screening large numbers of pol- 40 FIG - 3 illustrates irradiation of the substrate at a 
ymers for biological activity. To screen for biological second location; 

activity, the substrate is exposed to one or more recep- HG. 4 illustrates the substrate after application of 
tors such as antibodies whole cells, receptors on vesi- monomer "B"; 

cles, lipids, or any one of a variety of other receptors. 5 illustrates irradiation of the M A** monomer; 

The receptors are preferably labeled with, for example, 45 F1G - 6 illustrates the substrate after a second applica- 
a fluorescent marker, radioactive marker, or a labeled **on of **B W ; 

antibody reactive with the receptor. The location of the FIG. 7 illustrates a completed substrate; 
marker on the substrate is detected with, for example, FIGS. 8A and SB illustrate alternative embodiments 
photon detection or autoradiographic techniques. of a reactor system for forming a plurality of polymers 
Through knowledge of the sequence of the material at 50 on a substrate*, 

the location where binding is detected, it is possible to 9 illustrates a detection apparatus for locating 

quickly determine which sequence binds with the re- fluorescent markers on the substrate; 
ceptor and, therefore, the technique can be used to FIGS. 10A-10M illustrate the method as it is applied 
screen large numbers of peptides. Other possible appli- to the production of the trimers of monomers "A" and 
cations of the inventions herein include diagnostics in 55 **B"; 

which various antibodies for particular receptors would FIGS- 11A and MB are fluorescence traces for stan- 
ce placed on a substrate and, for example, blood sera dard fluorescent beads; 

would be screened for immune deficiencies. Still further FIGS. 12A and 12B are fluorescence curves for 
applications include, for example, selective "doping" of NVOC (6-nitroveratryloxycarbonyl) slides not exposed 
organic materials in semiconductor devices, and the 60 and exposed to light respectively; 
like. FIGS. 13A to 13D are fluorescence plots of slides 

In connection with one aspect of the invention an exposed through 100 Jim, 30 ftm, 20 fun, and 10 Jim 
improved reactor system for synthesizing polymers is masks; 14A and 14B illustrate formation of YGGFL (a 
also disclosed. The reactor system includes a substrate peptide of sequence H2N-ryrosine-glycme-glycine- 
mount which engages a substrate around a periphery 65 phenylalanine-leucine-COiH) and GGFL (a peptide of 
thereof. The substrate mount provides for a reactor sequence H2N-glycme-glycme*phenylalaninje-leiicine- 
space between the substrate and the mount through or CO2H), followed by exposure to labeled Herz antibody 
into which reaction fluids are pumped or flowed. A (an antibody that recognizes YGGFL but not GGFL); 
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FIGS. ISA and 15B fluorescence plots of a slide with 
a checkerboard pattern of YGGFL and GGFL exposed 
to labeled Herz antibody; FIG. tSA illustrates a 
500x500 um mask which has been focused on the sub- 
strate according to FIG. 8A while FIG. 15B illustrates 5 
a 50x50 jim mask placed in direct contact with the 
substrate in accord with FIG. 8B; 

FIG. 16 is a fluorescence plot of YGGFL and 
PGGFL synthesized in a 50 /im checkerboard pattern; 

FIG. 17 is a fluorescence plot of YPGGFL and io 
YGGFL synthesized in a 50 /un checkerboard pattern; 

PIGS. 18A and 18B illustrate the mapping of sixteen 
sequences synthesized on two different glass slides; 

FIG. 19 is a fluorescence plot of the slide illustrated 
in FIG. 18A; and 15 

FIG. 20 is a fluorescence plot of the slide illustrated 
in FIG. 10B. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

20 
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I. Glossary 

The following terms are intended to have the follow- 55 
ing general meanings as they are used herein: 

1. Complementary: Refers to the topological compati- 
bility or matching together of interacting surfaces of 
a ligand molecule and its receptor. Thus, the receptor 
and its ligand can be described as complementary, 60 
and furthermore, the contact surface characteristics 
are complementary to each other. 

2. Epitope: The portion of an antigen molecule which is 
delineated by the area of interaction with the subclass 
of receptors known as antibodies. 65 

3. Ligand: A ligand is a molecule that is recognized by 
a particular receptor. Examples of ligands that can be 
investigated by this invention include, but are not 
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restricted to, agonists and antagonists for cell mem- 
brane receptors, toxins and venoms, viral epitopes, 
hormones (eg., steroids, etc), hormone receptors, 
peptides, enzymes, enzyme substrates, cofactors, 
drugs (e.g., opiates, etc), lectins, sugars, oligonucleo- 
tides, nucleic acids, oligosaccharides, proteins, and 
monoclonal antibodies. 

4, Monomer: A member of the set of small molecules 
which can be joined together to form a polymer. The 
set of monomers includes but is not restricted to, for 
example, the set of common L-amino acids, the set of 
D-amino acids, the set of synthetic amino acids, the 
set of nucleotides and the set of pentoses and hexoses. 
As used herein, monomers refers to any member of a 
basis set for synthesis of a polymer. For example, 
dimers of L-amino acids form a basis set of 400 mono- 
men for synthesis of polypeptides. Different basis 
sets of monomers may be used at successive steps in 
the synthesis of a polymer. 

5. Peptide: A polymer in which the monomers are alpha 
amino acids and which are joined together through 
amide bonds and alternatively referred to as a poly- 
peptide. In the context of this specification it should 
be appreciated that the amino acids may be the L- 
optical isomer or the D-optical isomer. Peptides are 
more than two amino acid monomers long, and often 
more than 20 amino acid monomers long. Standard 
abbreviations for amino acids are used (e.g., P for 
proline). These abbreviations are included in Stryer, 
Biochemstry, Third Ed., 1988, which is incorporated 
herein by reference for all purposes. 

S. Radiation: Energy which may be selectively applied 
including energy having a wavelength of between 
10- 14 and 10 4 meters including, for example, electron 
beam radiation, gamma radiation, x-ray radiation, 
ultraviolet radiation, visible light, infrared radiation, 
microwave radiation, and radio waves. "Irradiation" 
refers to the application of radiation to a surface. 
L Receptor A molecule that has an affinity for a given 
ligand. Receptors may be naturally-occuring or man- 
made molecules. Also, they can be employed in their 
unaltered state or as aggregates with other species. 
Receptors may be attached, covalently or noncova- 
lently, to a binding member, either directly or via a 
specific binding substance. Examples of receptors 
which can be employed by this invention include, but 
are not restricted to, antibodies, cell membrane recep- 
tors, monoclonal antibodies and antisera reactive 
with specific antigenic determinants (such as on vi- 
ruses, cells or other materials), drugs, polynucleo- 
tides, nucleic acids, peptides, cofactors, lectins, sug- 
ars, polysaccharides, cells, cellular membranes, and 
organelles. Receptors are sometimes referred to in the 
art as anti-ligands. As the term receptors is used 
herein, no difference in meaning is intended. A "Li- 
gand Receptor Pair*' is formed when two macromol- 
ecules have combined through molecular recognition 
to form a complex. 

Other examples of receptors .which can be investi- 
gated by this invention include but are not restricted to: 
a) Microorganism receptors: Determination of li- 
gands which bind to receptors, such as specific 
transport proteins or enzymes essential to survival 
of microorganisms, is useful in a new class of antibi- 
otics. Of particular value would be antibiotics 
against opportunistic fungi, protozoa, and those 
bacteria resistant to the antibiotics in current use. 
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b) Enzymes: For instance, the binding site of enzymes o-Hydroxy-a-metbyl cinnamoyl, and 2-Oxymethy- 
such as the enzymes responsible for cleaving neu- lene anthraquinone. Other examples of activators 
rotransmitters; determination of ligands which bbd include ion beams, electric fields, magnetic fields, 
to certain receptors to modulate the action of the electron beams, x-ray, and the like. 

enzymes which cleave the different neurotransmit- 5 10. Predefined Region: A predefined region is a local- 

ters is useful in the development of drugs which ized area on a surface which is, was, or is intended to 

can be used in the treatment of disorders of neuro- be activated for formation of a polymer. The prede- 

t ^ nsmiS ? on * fined region may have any convenient shape, e.g., 

c) Antibodies: For instance, the invention may be circular, rectangular, elliptical, wedge-shaped, etc. 
useftil in investigating the hgand-binding site on the 10 For the sake of brevity herein, predefined regions" 
antibody molecule which combines with the epi- are sometimes referred to simply as "regions." 
tope of an antigen of interest; determining a se- u . Substantially Pure: A polymer is considered to be 
quence that mimics an antigenic epitope may lead "substantially pure" within a predefined region of a 
to the development of vaccines of which the immu- substrate when it exhibits characteristics that distin- 
nogen isb^ 15 ^ h from othcr predefined regions. Typically, 

«r deVel ° pmcn V ?. r £ atCd *?^ c purity will be measured in terms of Hologic^ activity 

J^Z^^T , ™ " ^ CrapCU f or function as a result of uniform sequence. Such 

znents such as for auto immune diseases (eg., by „-n ♦,«:^n„ i_ f 

blocking the binding of the "self antibodies). cha^enstics wflUypically be measured by way of 

d) Nucleic Acids: Sequences of nucleic acids miy be 20 TT ^ du]g J* 1 * a ,dected Lgand 01 rece P tor - 
synthesized to establish DNA or RNA binding -ir ^ . , A . J 
sequences. e P rescnt mven k° n provides methods and appara- 

e) Catalytic Polypeptides: Polymers, preferably poly- the preparation and use of a substrate having a 
peptides, which are capable of promoting a chVmi- °f ™ ? rcd f nc ^ J c & OQS A 
cal reaction involving the conversion of one or 25 ^ VCDbOD ? described herein primarily with regard 
more reactants to one or more products. Such to ? e P^™n of molecules containing sequences of 
polypeptides generally include a binding site spe- amm0 acids > but 00111(3 readdy a PP hed m Ae P«para- 
cific for at least one reactant or reaction intermedi- non of other P° Iymers - Such P° lvmers include, for 
ate and an active functionality proximate to the cxsm P^ both linear and cyclic polymers of nucleic 
binding site, which functionality is capable of 30 acids * polysaccharides, phospholipids, and peptides 
chemically modifying the bound reactant Cata- having either a-, or w-amino acids, heteropolymers 
lytic polypeptides are described in, for example, m which a ^own drug * covalently bound to any of 
U.S. application Ser. No. 404,920, which is incor- ^ above » polyurethanes, polyesters, polycarbonates, 
porated herein by reference for all purposes. polyureas, polyamides, polyethyleneimines, polyary- 

f) Hormone receptors: For instance, the receptors for 35 Iene s^des, polysiloxanes, polyimides, polyacetates, or 
insulin and growth hormone. Determination of the other polymers which will be apparent upon review of 
ligands which bind with high affinity to a receptor th * s disclosure. In a preferred embodiment, the inven- 
is useful in the development of, for example, an oral tioD herein is used in the synthesis of peptides, 
replacement of the daily injections which diabetics The prepared substrate may, for example, be used in 
must take to relieve the symptoms of diabetes, and 40 screening a variety of polymers as ligands for binding 
in the other case, a replacement for the scarce a receptor, although it will be apparent that the 
human growth hormone which can only be ob- invention could be used for the synthesis of a receptor 
tained from cadavers or by recombinant DNA for binding with a ligand. The substrate disclosed herein 
technology. Other examples are the vasoconstric- will have a wide variety of other uses. Merely by way of 
tive hormone receptors; determination of those 45 example, the invention herein can be used in determin- 
ligands which bind to a receptor may lead to the m S peptide and nucleic acid sequences which bind to 
development of drugs to control blood pressure. proteins, rinding sequence-specific binding drugs, iden- 

g) Opiate receptors: Determination of ligands which tifying epitopes recognized by antibodies, and evalua- 
bind to the opiate receptors in the brain is useful in ti*on of a variety of drugs for clinical and diagnostic 
the development of less-addictive replacements for 50 applications, as well as combinations of the above, 
morphine and related drugs. The invention preferably provides for the use of a 

8. Substrate: A material having a rigid or semi-rigid substrate **S" with a surface. Linker molecules *X" are 
surface. In many embodiments, at least one surface of optionally provided on a surface of the substrate. The 
the substrate will be substantially flat, although in purpose of the linker molecules, in some embodiments, 
some embodiments h may be desirable to physically 55 is to facilitate receptor recognition of the synthesized 
separate synthesis regions for different polymers polymers. 

with, for example, wells, raised regions, etched Optionally, the linker molecules may be chemically 
trenches, or the like. According to other embodi- protected for storage purposes. A chemical storage 
meats, small beads may be provided on the surface protective group such as t-BOC (t-butoxycarbonyl) 
which may be released upon completion of the syn- 60 may be used in some embodiments. Such chemical pro- 
thesis' tective groups would be chemically removed upon 

9. Protective Group: A material which is bound to a exposure to, for example, acidic solution and would 
monomer unit and which may be spatially removed serve to protect the surface during storage and be re- 
upon selective exposure to an activator such as dec- moved prior to polymer preparatioa 

tromagnetic radiation. Examples of protective groups 65 On the substrate or a distal end of the linker mole- 
with utility herein include Nitroveratryloxy car- cules, a functional group with a protective group Pois 
bonyl, Nitrobenzyloxy carbonyl, Dimethyl dime- provided. The protective group Po may be removed 
tboxybenzyloxy carbonyl, 5-Bromo-7-nitroindolinyl, upon exposure to radiation, electric fields, electric cur- 



5,445,934 

9 10 

rents, or other activators to expose the functional followed by contacting with Mi-P, resulting in the se- 

groop. quence S-Mi-P at the first location. The second loca- 

In a preferred embodiment, the radiation is ultraviolet tions would then be irradiated . and contacted with 

(LTV), infrared (1R), or visible light As more fully de- M4-P, resulting in the sequence S-M4-P at the second 

scribed below, the protective group may alternatively 5 locations. Thereafter both the first and second locations 

be an electrochemically-sensitive group which may be would be irradiated and contacted with the dimer Mj- 

removed in the presence of an electric field. In still ^3, resulting in the sequence S-M)-M2-M3 at the first 

further alternative embodiments, ion beams, electron locations and S-M4-M2-M3 at the second locations. Of 

beams, or the like may be used for deprotection. course, common subsequences of any length could be 

In some embodiments, the exposed regions and, 10 utilized including those in a range of 2 or more mono- 

tberefore, the area upon which each distinct polymer mcrSj 2 to 100 monomers, 2 to 20 monomers, and a most 

sequence is synthesized are smaller than about 1 cm 3 or preferred range of 2 to 3 monomers, 

less than 1 mm*. In preferred embodiments the exposed According to other embodiments, a set of masks is 

area u less than about 10,000 um* or, more preferably, ^ for ^ first mODOmer byer ^ thereafter, varied 

less than 100 una* and may, in some embodiments, en- 15 Hght wavelengths arc used for selective deprotection. 

Sr^F*.* ^ebmomgsiteforasfcwasasmglemolecule. For ^ {he process discnssed abovCf first re . 

eions are first exposed through a mask and reacted with 

??tZ 8 f***^ form - . . . I first monomer having a first protective group P,, 

Concurrently or after exposure of a known region of , . , , *> r . - . ^ , r V 

♦v. ♦! i- v *v v » k » ... which is removable upon exposure to a first wavelength 

the substrate to light, the surface is contacted with a 20 r , , k , TT>N « * . 5 - , ^ 

first monomer unit Mi which reacts with the functional of m) " «ffons are masked and re- 

group which has been exposed by the deprotection step. » second monomer having a second prote- 

The first monomer includeTa protective group P,. P, c,ve «f on P P * wh * h » 'f=o v f w « "^Pf' » a 
may or may not be the same as Pn. second wavden S th of ^ <**• ^ Thereafter ' 

Accordingly, after a first cycle, known first regions 25 ^ b f on,e " mie f ary * the bee "!» e « he 
of the surface may comprise the sequence-. «"« «*«««e may be exposed altemauvely to the first 

and second wavelengths of light in the deprotection 
S-L-M1-P1 cycle. 

The polymers prepared on a substrate according to 
while remaining regions of the surface comprise the 20 ^ e a ^° ve methods will have a variety of uses including, 
sequence: for example, screening for biological activity. In such 

screening activities^ the substrate containing the sequen- 
S-l^flo- ces is exposed to an unlabeled or labeled receptor such 

as an antibody, receptor on a cell, phospholipid vesicle, 
Thereafter, second regions of the surface (which may or any one 0 f a variety of other receptors- In one pre- 
include the first region) are exposed to light and con- f crT cd embodiment the polymers are exposed to a first, 
tacted with a second monomer M2 (which may or may unlabeled receptor of interest and, thereafter, exposed 
not be the same as Mi) having a protective group P2. P2 to a labeled receptor-specific recognition element, 
may or may not be the same as P 0 and Pi. After this wbic h ^ f or cxa mple, an antibody. This process will 
second cycle, different regions of the substrate may provide signal amplification in the detection stage, 
comprise one or more of the following sequences: The receptor molecules may bind with one or more 

c t w « or w r, ^ , w « , , polymers on the substrate. The presence of the labeled 

S-L-Mj-Mj-Pa S-L-M2-P2 S-L-M1-P1 and/or k , ^, r . ^ r 

s-i^q. receptor and, therefore, the presence of a sequence 

which binds with the receptor is detected in a preferred 

The above process is repeated until the substrate in- 45 embodiment through the use of autoradiography, detec- 

cludes desired polymers of desired lengths. By control- tion of fluorescence with a charge-coupled device, fluo- 

ling the locations of the substrate exposed to light and rescence microscopy, or the like. The sequence of the 

the reagents exposed to the substrate following expo- polymer at the locations where the receptor binding is 

sure, the location of each sequence will be known. detected may be used to determine all or part of a se- 

Thereafter, the protective groups are removed from 50 quence which is complementary to the receptor, 
some or all of the substrate and the sequences are, op- Usc of the invention herein is illustrated primarily 
tionally, capped with a capping unit C. The process reference to screening for biological activity. The 

results in a substrate having a surface with a plurality of invention will, however, find many other uses. For 
polymers of the following general formula: example, the invention may be used in information stor- 

55 age (eg., on optical disks), production of molecular 
S-[LHM^(MjXMa) • - • (MxHCl electronic devices, production of stationary phases in 

separation sciences, production of dyes and brightening 
where square brackets indicate optional groups, and M/ agents, photography, and in immobilization of cells, 
. . . Mr indicates any sequence of monomers. The num- proteins, lectins, nucleic acids, polysaccharides and the 
bcr of monomers could cover a wide variety of values, 60 like in patterns on a surface via molecular recognition of 
but in a preferred embodiment they will range from 2 to specific polymer sequences. By synthesizing the same 
100. compound in adjacent, progressively differing concen- 

In some embodiments a plurality of locations on the tra tions, a gradient will be established to control chemo- 
substrate polymers are to contain a common monomer taxis or to develop diagnostic dipsticks which, for ex- 
subsequence. For example, it may be desired to synthe- 65 ample, titrate an antibody against an increasing amount 
size a sequence S-Mi-Mz-Mj at first locations and a of antigen. By synthesizing several catalyst molecules in 
sequence S-Mi-Mz-Mj at second locations. The process close proximity, more efficient multistep conversions 
would commence with irradiation of the first locations may be achieved by "coordinate immobilization." Co- 
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ordinate immobilization also may be used for electron completed substrate to interact freely with molecules 

transfer systems, as well as to provide both structural exposed to the substrate. The KnV*>r molecules should 

integrity and other desirable properties to materials be 6-50 atoms long to provide sufficient exposure. The 

such as lubrication, wetting, etc. linker molecules may be, for example, aryl acetylene, 

According to alternative embodiments, molecular 5 ethylene glycol oligomers containing 2-10 monomer 

biodistribution or pharmacokinetic properties may be units, diamines, diacids, amino acids, or combinations 

examined. For example, to assess resistance to intestinal thereof. Other linker molecules may be used in light of 

or serum proteases, polymers may be capped with a this disclsoure. 

fluorescent tag and exposed to biological fluids of inter- According to alternative embodiments, the linker 
S?^* 10 molecules are selected based upon their hydrophilicA 

ill. TOlymer Synthesis hydrophobic properties to improve presentation of syn- 

FIG. 1 fllustrates one embodnnent of the invention thesized polymers to certain receptors. For example, in 

disclosed herem in which a substrate 2 » shown in ^ ^ of a hydropmlic reccpXO t, hydrophilic linker 

cross-secbon. Essentially, any conceivable substrate molecules will be preferred so as to permit the receptor 
Z i^7 P \° y £ ^e^enaon. ™e substrate may 15 l0 m0re closely approach the synthesized polymer 

^ m So ' f T^' ° tg ^Z morsamc - .°f a According to another alternative embodmie^ linker 

£Z£^£E y .£ Si 'ZT* 1 part,dCS ' molecules L also provided with a photocleavable 

strands, preapitates, gels sheet* tubmg, spheres, con- at ^ intermediate position. The photocleavable 

tamers, capillaries, pads, slices, films, plates, slides, etc • f \V i V?; 

The substrate may shape, such as a 20 f"* " P deaw *** a w ^elength different 

disc, square, sphere, circle, etc. The substrate is prefera- *™ * C P^tective f oup. 1ms enables removal of the 

bly flat but may take on a variety of alternative surface foIJo ^! w J?P let1011 of *5 

configurations. For example, the substrate may contain *\™ y ° f "P 0SUrC to wavelengths of 

raised or depressed regions on which the synthesis takes ' *w ,* , , , , _ v „ 
place. The substrate and its surface preferably form a 25 . The ^ molecules can be attached to the substrate 
rigid support on which to carry out the reactions de- via carbon-carbon bonds using, for example, (poly)tri- 
scribed herein. The substrate and its surface is also Auorochloroethylene surfaces, or preferably, by sOox- 
chosen to provide appropriate light-absorbing charac- (usmg ' for exam P le - glass or silicon oxide 
teristics. For instance, the substrate may be a polymer- surfaces )' Siloxane bonds with the surface of the sub- 
Lzed Langmuir Blodgett film, functionalized glass, Si, 30 strate ^ formed in one embodiment via reactions 
Ge, GaAs, GaP, SiCh, SIN4, modified silicon, or any ot tinker molecules bearing trichlorosilyl groups. The 
one of a wide variety of gels or polymers such as (poly)- linker molecules may optionally be attached in an or- 
tetrafluoroethylene, (poly)vinylidenedifluoride, poly- derc # d am V» i-^K parts of the head groups in a poly- 
styrene, polycarbonate, or combinations thereof. Other merized Langmuir Blodgett film. In alternative embods- 
substrate materials will be readily apparent to those of 35 ments * ^ e linker molecules are adsorbed to the surface 
skill in the art upon review of this disclosure. In a pre- °^ ^ e substrate. 

ferred embodiment the substrate is flat glass or single- The linker molecules and monomers used herein are 

crystal silicon with surface relief features of less than 10 provided with a functional group to which is bound a 

A. protective group. Preferably, the protective group is on 

According to some embodiments, the surface of the 40 tne distal or terminal end of the linker molecule oppo- 

substrate is etched using well known techniques to pro- * tc tDe substrate. The protective group may be either a 

vide for desired surface features. For example, by way negative protective group (le., the protective group 

of the formation of trenches, v-grcoves, mesa struc- renders the linker molecules less reactive with a mono- 

tures, or the like, the synthesis regions may be more mer upon exposure) or a positive protective group (i.e^ 

closely placed within the focus point of impinging light, 45 the protective group renders the linker molecules more 

be provided with reflective "mirror" structures for reactive with a monomer upon exposure). In the case of 

maximization of light collection from fluorescent negative protective groups an additional step of reacti- 

sources, or the like. vation will be required. In some embodiments, this will 

Surfaces on the solid substrate will usually, though be done by heating, 

not always, be composed of the same material as the 50 The protective group on the linker molecules may be 

substrate. Thus, the surface may be composed of any of selected from a wide variety of positive light-reactive 

a wide variety of materials, for example, polymers, groups preferably including nitro aromatic compounds 

plastics, resins, polysaccharides, silica or silica-based such as o-nitrobenzyl derivatives or benzylsulfonyL In a 

materials, carbon, metals, inorganic glasses, membranes, preferred embodiment, 6-nitrovcratryloxycarbonyl 

or any of the above-listed substrate materials. In some 55 (NVOC), 2-nitrobenzyloxycarbonyl (NBOQ or a,o 

embodiments the surface may provide for the use of dimethyl^iimethoxybeiizyloxycarbonyl (DDZ) is used, 

caged binding members which are attached firmly to In one embodiment; a nitro aromatic compound con- 

the surface of the substrate. Preferably, the surface will taking a benzylic hydrogen ortho to the nitro group is 

contain reactive groups, which could be carboxyl, used, Le., a chemical of the form: 
amino, hydroxyl, or the like. Most preferably, the sur- 60 
face will be optically transparent and will have surface 
Si — OH functionalities, such as are found on silica sur- 
faces. 

The surface 4 of the substrate is preferably provided 
with a layer of linker molecules 6, although it will be 65 
understood that the linker molecules are not required 
elements of the invention. The linker molecules are 
preferably of sufficient length to permit polymers in a 
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where Ki is alkoxy, alky], halo, aryl, alkenyl, or hydro- comprise a molecule which is decomposed by light such 

gen; R 2 is alkoxy, alkyl, halo, aryl, nitro, or hydrogen; as quinone diazide or a material which is transiently 

Rj is alkoxy, alkyl, halo, nitro, aryl, or hydrogen; R4 is bleached at the wavelength of interest Transient 

alkoxy, alkyl, hydrogen, aryl, halo, or nitro; and R5 is bleaching of materials will allow greater penetration 
alkyl, alkynyl, cyano, alkoxy, hydrogen, halo, aryl, or 5 where light is applied, thereby enhancing contrast Al- 

alkenyl. Other materials, which may be used include ternatively, contrast enhancement may be provided by 

o-hydroxy-a-methyl cinnamoyl derivatives. Photore- way of a cladded fiber optic bundle, 

movable protective groups are described in. for exam- The light may be from a conventional incandescent 

pie, Patchornik, /. Am Chcm. Soc. (1970) 92:6333 and source, a laser, a laser diode, or the like. If non-col- 
Amit et al„ 7. Org. Ckem. (1974) 39:192, both of which 10 hmated sources of light arc used it may be desirable to 

are incorporated herein by reference. provide a thick- or multi-layered mask to prevent 

In an alternative embodiment the positive reactive spreading of the light onto the substrate. It may, further, 

group is activated for reaction with reagents in solution. be desirable in some embodiments to utilize groups 

For example, a 5-brooo-7-nitro indoline group, when which are sensitive to different wavelengths to control 

t n t0 f ^° ny ' UndcrgOCS rcaCtK)n Up ° n cx P° surc 15 synthesis. For example, by using groups which are sen- 
10 ugnt at «o nm itive tQ meicni waV elengths, it is possible to select 

J" *™? nd r f MI ; ve *c reactive branch Mom m ^ svnthcsis of a ^ym CT or elimi . 

Kv Ir Z^ r^ J 1 t nate certain masking steps. Several reactive groups 

J^JL Itght-reactive groups including a ^ ^ ^ wavelengths for deprol 

ci nam male group. 20 * JZ , 

au.«,J«i.. a • ^ A j j tection are provided in Table 1. 

Alternatively, the reactive group 13 activated or deac- 

tivated by electron beam lithography, x-ray Hthogra- TABLE 1 

phy, or any other radiation. Suitable reactive groups for Approximate 

electron beam lithography include sulfonyl. Other Group Deprotcstion Wavelength 

methods may be used including, for example, exposure 25 Nitroventryioxy carbooyi (nvoq uv potMOO am) 

to a current source. Other reactive groups and methods Nitrobcozyioxy carbooyi (NBOQ uv (300-350 am) 

Of activation may be USed in light of this disclosure. Dimethyl dimethoxybeayloxy UV (280-300 am) 

As shown in FIG. 1, the linking molecules are prefer- f?° nyl , . r . x 

, K1 „ M j f , . VT 5-Bromo-7-moromdohny! UV (420 nm) 

ably exposed to, for example, light through a Suitable ^Hydroxy^i-metbyl cinnamoyl UV (300-350 nm) 

mask 8 using photolithographic techniques of the type 30 2-OxymeUiylcnc tnthraqinnone UV (350 nm) . 

known in the semiconductor industry and described in, ~ 
for example, Sze, VLSI Technology, McGraw-Hill ... ^ . .„ w > . . . . . 

(1983), and Mead et aL, IntroductioTto VLSI Systems, * C mV ? tIOn P™^*'™ * 

Addison-Wesley (1980), which are incorporated herein " ay £ 3 ^ * 

by reference foi ^^aU pur^ Tlie L^VmV be directed 35 ^ su^e, o^er technxj^ 
at either the surface containing the protective groups or 5™??* ^ ™ y * to™** ^er a modu- 

at the back of the substrate, so long as the substrate is ^ted ^f 0r ^source. Such ^chmques are 
transparent to the wavelength of light needed for re- ^ * ot U S * ^\ No ; t' 719 f 15 

moval of the protective groups. In the embodiment ^ * ^ wbch B ^corporated herein by refer- 
shown in FIG. 1, light is directed at the surface of the 40 CTCe ' In dte ™ati v « embodiments a laser galvanometric 
substrate containing the protective groups FIG 1 illus- scanDer » utilized. In other embodiments, the synthesis 
trates the use of such masking techniques as they are my take P kce on or m contact with a conventional 
applied to a positive reactive group so as to activate ^uid crystal (referred to herein as a "light valve") or 
linking molecules and expose functional groups m areas fiber °P tic h ^ x 50urces * B Y appropriately modulating 
10a and 106. 45 liquid crystals, light may be selectively controlled so as 

The mask 8 is in one embodiment a transparent sup- t0 P crmit ^gbt to contact selected regions of the sub- 
port material selectively coated with a layer of opaque stratc - Alternatively, synthesis may take place on the 
material. Portions of the opaque material are removed, cnd of a series of optical fibers to which light is selec- 
leaving opaque material in the precise pattern desired tively applied. Other means of controlling the location 
on the substrate surface. The mask is brought into close 50 °* u 'ght exposure will be apparent to those of skill in the 
proximity with, imaged on, or brought directly into art- 
contact with the substrate surface as shown in FIG. 1. The substrate may be irradiated either in contact or 
"Openings" in the mask correspond to locations on the not contact with a solution (not shown) and is, prefer- 
substrate where it is desired to remove photoremovable ably, irradiated in contact with a solution. The solution 
protective groups from the substrate. Alignment may be 55 contains reagents to prevent the by-products formed by 
performed using conventional alignment techniques in irradiation from interfering with synthesis of the poly- 
which alignment marks (not shown) are used to accu- mer according to some embodiments. Such by-products 
lately overlay successive masks with previous pattern- might include, for example, carbon dioxide, nitrosocar- 
ing steps, or more sophisticated techniques may be used. bonyl compounds, styrene derivatives, indole deriva- 
tor example, interferometric techniques such as the one 60 tives, and products of their photochemical reactions, 
described in Flanders et al., "A New Interferometric Alternatively, the solution may contain reagents used to 
Alignment Technique," App. Phys. Lett. (1977) match the index of refraction of the substrate. Reagents 
31:426-428, which is incorporated herein by reference, added to the solution may further include, for example, 
may be used. acidic or basic buffers, thiols, substituted hydrazines and 

To enhance contrast of light applied to the substrate, 65 hydroxylases, reducing agents (e.g. f NADH) or rea- 

it is desirable to provide contrast enhancement materials gents known to react with a given functional group 

between the mask and the substrate according to some (eg., aryl nitroso + glyoxy lie acid-tary! formhydrox- 

embodiments. This contrast enhancement layer may amate+C02). 
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Either concurrently with or after the irradiation step, According to some embodiments, several sequences 
the linker molecules are washed or otherwise contacted are intentionally provided within a single region so as to 
with a first monomer, illustrated by "A" in regions 12a provide an initial screening for biological activity, after 
and 12b in FIG. 2. The first monomer reacts with the which materials within regions exhibiting, significant 
activated functional groups of the linkage molecules 5 binding are further evaluated, 
which have been exposed to light The first monomer, IV. Details of One Embodiment of a Reactor System 
which is preferably an amino acid, is also provided with FIG. 8A schematically illustrates a preferred embodi- 
a photoprotective group. The photoprotective group ment of a reactor system 100 for synthesizing polymers 
on the monomer may be the same as or different than on the prepared substrate in accordance with one aspect 
the protective group used in the linkage molecules, and 10 of the invention. The reactor system includes a body 
may be selected from any of the above-described pro- 102 with a cavity 104 on a surface thereof. In preferred 
tective groups. In one embodiment, the protective embodiments the cavity 104 is between about 50 and 
groups for the A monomer is selected from the group 1000 urn deep with a depth of about 500 Jim preferred. 
NBOC and NVOC The bottom of the cavity is preferably provided with 

As shown in FIG. 3, the process of irradiating is 15 an array of ridges 106 which extend both into the plane 
thereafter repeated, with a mask repositioned so as to of the Figure and parallel to the plane of the Figure, 
remove linkage protective groups and expose functional The ridges are preferably about 50 to 200 urn deep and 
groups in regions 14a and 14* which are illustrated as spaced at about 2 to 3 mm. The purpose of the ridges is 
being regions which were protected in the previous to generate turbulent flow for better mixing. The bot- 
masking step. As an alternative to repositioning of the 20 torn surface of the cavity is preferably light absorbing so 
first mask, in many embodiments a second mask will be as to prevent reflection of impinging light 
utilized. In other alternative embodiments, some steps A substrate 112 is mounted above the cavity 104. The 
may provide for illuminating a common region in sue- substrate is provided along its bottom surface 114 with 
cessiye steps. As shown in FIG. 3, ft may be desirable to a photoremovable protective group such as NVOC 
provide separation between irradiated regions. For ex- 25 with or without an intervening linker molecule. The 
ample, separation of about 1-5 urn may be appropriate substrate is preferably transparent to a wide spectrum of 
to account for alignment tolerances. light, but in some embodiments is transparent only at a 

As shown in FIG. 4, the substrate is then exposed to wavelength at which the protective group may be re- 
a second protected monomer "B," producing B regions moved (such as UV in the case of NVOC). The sub- 
16a and 16b. Thereafter, the substrate is again masked so 30 strate in some embodiments is a conventional micro- 
as to remove the protective groups and expose reactive scope glass slide or cover slip. The substrate is prefera- 
groups on A region 12a and B region 16*. The substrate bly as thin as possible, while stiU providing adequate 
is agam exposed to monomer B, resulting in the forma- physical support Preferably, the substrate is less than 
tion of the structure shown in FIG. 6. The dimers B-A about 1 mm thick, more preferably less than 0.5 mm 
and B-B have been produced on the substrate. . . 35 thick, more preferably less than 0. 1 mm thick, and most 
A subsequent series of masking and contacting steps preferably less than 0.05 mm thick. In alternative pre- 
similar to those described above with A jfcot shown) ferred embodiments, the substrate is quartz or silicon, 
provides the structure shown in FIG. 7. The process The substrate and the body serve to seal the cavity 
provides afl possible dimers of B and A, Le. f B-A, A-B, except for an inlet port 108 and an outlet port 110. The 
t£ v 40 txxty and the substrate may be mated for sealing in some 

The substrate, the area of synthesis, and the area for embodiments with one or more gaskets. According to a 
synthesis of each individual polymer could be of any preferred embodiment, the body is provided with two 
size or shape. For example, squares, ellipsoids, rectan- concentric gaskets and the intervening space is held at 
gles, triangles, circles, or portions thereof, along with vacuum to ensure mating of the substrate to the gaskets, 
irregular geometric shapes, may be utilized. Duplicate 45 Fluid is pumped through the inlet port into the cavity 
synthesis areas may also be applied to a single substrate by way of a pump 116 which may be, for example, a 
for purposes of redundancy. model no. B-120-S made by Eldex Laboratories. Se- 

In one embodiment the regions 12a, 12b and 16a, 16b lected fluids are circulated into the cavity by the pump, 
on the substrate will have a surface area of between through the cavity, and out the outlet for recirculation 
about 1 cm* and 10- "> cm*. In some embodiments the 50 or disposal The reactor may be subjected to ultrasonic 
regions 12a, 126 and 16c, 16b have areas of less than radiation and/or heated to aid m agitation in some em- 
about 10- « cm2, 10-2 cm 2 , 10-3 C m 5 , 10~* cm*, 10-5 bodiments. 

cm2 10-fi cm2, 10-7 cm2, 10 -8 cm*, or 10- » cm2. In a Above the substrate 112, a lens 120 is provided which 
preferred embodiment, the regions 12a, 126 and 16a, may be, for example, a 2" 100 mm focal length fused 
16b are between about 10x 10 um and 500x500 um 55 silica lens. For the sake of a compact system, a reflective 
In some embodiments a single substrate supports mirror 122 may be provided for directing light from a 
more than about 10 different monomer sequences and light source 124 onto the substrate. Light source 124 
perferably more than about 100 different monomer may be, for example, a Xe(Hg) light source manufac- 
sequences, although w some embodiments more than tured by Oriel and having model no. 66024. A second 
about 10*, 10*, 105, 10 6 f i 0 7 ( or 10* different sequences 60 lens 126 may be provided for the purpose of projecting 
are provided on a substrate. Of course, within a region a mask image onto the substrate in combination with 
of the substrate in which a monomer sequence is synthe- lens 120. This form of lithography is referred to herein 
sized, it is preferred that the monomer sequence be as projection printing. As will be apparent from mis 
substantially pure. In some embodiments, regions of the disclosure, proximity printing and the like may also be 
substrate contain polymer sequences which are at least 65 used according to some embodiments. 
!^ Ul Ih* 9 !' l0% * ,5% > 2 ^ 25 ^30%,35%,40%, Light from tbelight source ^permitted to reach only 
45%, 50%, 60%, 70%, 80%, 90%, 95%, 96%. 97% selected locations on the substrate as a result of mask 
98% or 99% pure. !2«. Mask 128 may be, for example, a glass slide having 
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etched chrome thereon. The mask 128 in one embodi- through the pump, through the cavity, and back to the 

mcnt is provided with a grid of transparent locations inlet of the pump 

and opaque locations. Such masks may be manufactured The monomer carrier solution is, in a preferred em- 

freelv L^>? £ ^ 0l ° Inc ; bodiment, formed by mixing of a first solution (referred 

L JSSfe^ tra«par«t regions of the mask, but 5 to herein as solution "A") and a second solution (re- 

for ZT^ ibSOrbe ? * y °* CT rCgl0nS ' ferred to hercin 25 *> Iun °* "B"). Table 2 provides an 

to hght TCgl0nS ° substratc e *P° sed lustration of a mixture which may be used for solution 

As discussed above, light valves (LCD's) may be * 
used as an alternative to conventional masks to selec- 10 — TABLE 2 

tively expose regions of the substrate. Alternatively, Representative Monomer Carrier Soluqpo "A** 

fiber optic faceplates such as those available from 100 m$ nvoc uniso protected aminoadd 

fccnott Glass, Inc, may be used for the purpose of con- » »8 HOBT (l-H^Jroaybenrotrurole) 

trast enhancement of the mask or as the sole means of 230 ^ DMF (Oiocthyiibnnaniide) 

restricting the region to which light is applied. Such 15 86 >u diea (T ^propyicuiylimmc) 

faceplates would be placed directly above or on the 

substrate in the reactor shown in FIG. «A. In still fur- The composition of solution B is illustrated in Table 
ther embodiments, flys-eye lenses, tapered fiber optic 3 - Solutions A and Bare mixed and allowed to react at 
faceplates, or the like, may be used for contrast en- room temperature for about 8 minutes, then diluted 
hancement 20 with 2 ml of DMF, and 500 /il are applied to the surface 

In order to provide for iUumination of regions smaller of the slide or the solution is circulated through the 
than a wavelength of light, more elaborate techniques reactor system and allowed to react for about 2 hours at 
may be utilized. For example, according to one pre- room temperature. The slide is then washed with DMF, 
ferred embodiment, light is directed at the substrate by methylene chloride and ethanoL 
way of molecular microcrystals on the tip of, for exam- 25 T . BT P , 

pie, micropipettes. Such devices are disclosed in Lieber- xajlc j _ 

man et a]., "A Light Source Smaller Than the Optical Represeaative Monomer Carrier Solution 

Wavelength," Science (1990) 247:59-£l, which is incor- DMF 
porated herein by reference for aD purposes in mg BOP (Benzoiriizoiy^-or*-tro(d^ 

In operation, the substrate is placed on the cavity and 30 phosphomcahe^aoropnoipluite) 

sealed thereto. All operations in the process of prepar- 
ing the substrate are carried out in a room lit primarily ^ toe solution containing the monomer to be at- 
or entirely by light of a wavelength outside of the light tacbed is circulated through the cavity, the amino acid 
range at which the protective group is removed. For or other monomer will react at its carboxy terminus 
example, m the case of NVOC, the room should be lit 35 amino groups on the regions of the substrate which 
with a conventional dark room light which provides have been deprotected. Of course, while the invention 
little or no UV light All operations are preferably con- « illustrated by way of circulation of the monomer 
ducted at about room temperature, " through the cavity, the invention could be practiced by 

A first, deprotection fluid (without a monomer) is wav of removing the slide from the reactor and sub- 
circulated through the cavity. The solution preferably is 40 versing it in an appropriate monomer solution, 
of 5 mM sulfuric acid in dioxane solution which serves After addition of the first monomer, the solution 
to keep exposed amino groups protonated and decreases containing the first amino acid is then purged from the 
their reactivity with photolysis by-products. Absorp- system After circulation of a sufficient amount of the 
ove materials such as N^-cUemylamino 2,4-dinitrober> DMF/methylene chloride such that removal of the 
zene, for example, may be included in the deprotection 45 amino acid can be assured (eg., about 50X times the 
fluid which serves to absorb light and prevent reflection volume of the cavity and carrier lines), the macV or 
andtmwauted photolysis. substrate is repositioned, or a new mask is utilized such 

The slide is, thereafter, positioned in a light raypath that second regions on the substrate will be exposed to 
from the mask such that first locations on the substrate light and the light 124 is engaged for a second exposure, 
are Ruminated and, therefore, deprotected. In pre- 50 This will deprotect second regions on the substrate and 
ferred embodiments the substrate is illuminated for be- the process is repeated until the desired polymer se- 
tween about 1 and 1 5 minutes with a preferred iUumina- auences have been synthesized, 
toon time of about 10 minutes at 10-20 mW/cm* with The entire derivatized substrate is then exposed to a 
J65 mn bght. The slides are neutralized (Le., brought to receptor of interest, preferably labeled with, for exam- 
Vvl a 7 T Photolysis with, for example, a 55 pie, a fluorescent marker, by circulation of a solution or 
solution of oj-isopropylefcylamine (DIEA) in methy- suspension of the receptor through the cavity or by 
lenechlonde for about 5 minutes. contacting the surface of the slide in bulk. The receptor 

TJe first monomer is then placed at the first locations will preferentially bind to certain regions of the sub- 
on the substrate. After irradiation, the slide is removed, strate which contain complementary sequences, 
toted m bulk, and then reinstalled in the flow cell. 60 Antibodies are typically suspended !fa what is com- 
Alternatively, a fluid containing the first monomer, monJy referred to as "supercocktail/' which may be, for 

cuTaS ^ b l * pr0tC f VC ^"P' k cir ' a of *™ ™ BSA (bovine serum 

culated through the cavity by way of pump 116. If, for albumin), 0.5% Tween TM non-ionic detergent in PBS 

S at ^r^Jl^l^ ^ acid Y to the (phosphate buffered saline) buffer. Tht antibodies are 

1 ° C ? l0nS ' * C aaa ?° "*» Y Clearing 65 diluted into the supercocktail buffer to a final concen- 
ISfS ^ 5 °\ ltS a - mtro S cn >> ^ong with rea- tration of, for example, about 0.1 to 4 ug/ml. 

*f^jT % T "t Z m0n0mCr rCaCtiv ^ and/or a na * B flutes alternative preferred embodi- 
camer, is circulated from a storage container 118, meat of the reactor shown in FIG. 8A. According to 
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this embodiment, the mask 128 is placed directly in pled; followed by a third mask, for the C column; and a 
contact with the substrate. Preferably, the etched por- final mask that exposes the right-most column, for D. 
tion of the mask is placed face down so as to reduce the The first, second, third, and fourth ma^g may be a 
effects of light dispersion. According to this embodi- single mask translated to different locations, 
xnent, the imaging lenses 120 and 126 are not necessary 5 The process is repeated in the horizontal direction for 
because the mask is brought into close proximity with the second unit of the dimer. This time, the ra«Vg allow 
the substrate. exposure of horizontal rows, again 0.25 cm wide. A, B, 

For purposes of increasing the signal-to-noise ratio of C # and D .are sequentially coupled using masks that 
the technique, some embodiments of the invention pro- expose horizontal fourths of the reaction area. The 
vide for exposure of the substrate to a first labeled or 10 resulting substrate contains all 16 dinucleotides of four 
unlabeled receptor followed by exposure of a labeled, bases. 

second receptor (e.g., an antibody) which binds at mul- The eight masks used to synthesize the dinucleotide 
tiple sites on the first receptor. If, for example, the first are related to one another by translation or rotation. In 
receptor is an antibody derived from a first species of an fact, one mask can be used in all eight steps if it is suit- 
animal, the second receptor is an antibody derived from 15 ably rotated and translated. For example, in the example 
a second species directed to epitopes associated with the above, a mask with a single transparent region could be 
first species. In the case of a mouse antibody, for exam- sequentially used to expose each of the vertical col- 
ple, fluorescently labeled goat antibody or antiserum umns, translated 90% and then sequentially used to 
which is antimouse may be used to bind at multiple sites allow exposure of the horizontal rows, 
on the mouse antibody, providing several times the 20 Tables 4 and 5 provide a simple computer program in 
fluorescence compared to the attachment of a single Quick Basic for planning a masking program and a 
mouse antibody at each binding site. This process may sample output, respectively, for the synthesis of a poly- 
be repeated again with additional antibodies (eg., goat- mer chain of three monomers ("residues") having three 
mouse-goat, etc) for further signal amplification. different monomers in the first level, four different mon- 

In preferred embodiments an ordered sequence of 25 omers in the second level, and five different monomers 
masks is utilized. In some embodiments it is possible to in the third level in a striped pattern. The output of the 
use as few as a single mask to synthesize all of the possi- program is the number of cells, the number of "stripes" 
ble polymers of a given monomer set (light regions) on each mask, and the amount of transla- 

If, for example, it is desired to synthesize all 16 dinu- tion required for each exposure of the mask. 

TABLE 4 



Mask Strategy Program 



DEFINT A-Z 

DIM b(20), w(20X 1(500) 

FS ~ -LPTir 

OPEN fS FOR OUTPUT AS 0 I 
jmaa «* 3 "Number of residues 

b(l) « 3: bC2) «= 4: b(3) « 5 *N umber of eu2<fing blocks for res U.3 
g^l: ]max(l) = 1 

FORj- ITOjmax:g- ««b(j):NEXTj 
w(0)»Chw(l) = g/b(l) 

PRINT 01. "MASK2.BAS DATES, TIMES: PRINT #1. 
PRINT #t, USING -Number of residues = 0 jma* 
FOR j «= 1 TO jmax 

PRINT 01, USING " Residue 00 bttOdiog blocks"; j; b(j) 

NEXT j 
PRINT #1," 

PRINT 0U USING "Number oT cellx=###£"; & PRINT 0 1, 
FOR j « 2TOjmax 

lmaxG) « ImautO - I) • bfj — 1) 
w0) = w0» l)/KD 
NEXTj 

FOR j o ITOjmax 

PRINT 0\, USING "Mask for residue 00 m ; j: PRINT 0\, 
PRINT 01, USING ** Number of stripes- lmax(j) 
PRINT 01, USING " Width of each wipe w© 
FOR 1 « I TO lmixCD 
a « ! + (1 - 1) • wfj - 1) 
ae = a + wfj) — ] 

PRINT 01, USING " Stripe 00 begins at location 000 and ends at ###'*: 1; a: ae 
NEXT 1 
PRINT 0 1, 

PRINT 0 1. USING " For each of 00 bidding blocks, translate mask by 00 
cellO)"; bfjh wfj, 

PRINT 01, : PRINT 01, : PRINT #1, 
NEXTj 
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cleo tides from four bases, a 1 cm square synthesis region 

is divided conceptually into 16 boxes, each 0.25 cm TABLE 5 

wide. Denote the four monomer units by A, B, C and 



D. The first reactions are carried out in four vertical 65 Masking Strategy Ootpgt 

columns, each 0.25 cm wide, The first mask exposes the Nmnbcr 3 
left-most colnmn of b° xes > where A is coupled The 

second mask exposes the ne» column, where B is cou- Rsiitea s buOdin j bloda 
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TABLE 5-continued 
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Mjiikfog Strategy Outpur 



apenurc plate 211 may be, for example, a model no. 
477352/477380 manufactured by Carl Zeiss. 

The fluoresced light then enters a photomultiplier 
tube 212 which in some embodiments is a model no. 
5 R943-02 manufactured by Hamamatsu, the signal is 
amplified in preamplifier 214 and photons are counted 
by photon counter 216. The number of photons is re- 
corded as a function of the location in the computer 204. 
Pre-Amp 214 may be, for example, a model no. SR440 
10 manufactured by Stanford Research Systems and pho- 
ton counter 216 may be a model no. SR400 manufac- 
tured by Stanford Research Systems. The substrate is 
then moved to & subsequent location and the process is 
repeated In preferred embodiments the data are ac- 
15 quired every 1 to 100 /im with a data collection diame- 
ter of about 0.8 to 10 urn preferred. In embodiments 
with sufficiently high fluorescence, a CCD (change 
coupled device) detector with broadiield illumination is 
utilized. 

20 By counting the number of photons generated in a 
given area in response to the laser, it is possible to deter- 
mine where fluorescent marked molecules are located 
on the substrate. Consequently, for a slide which has a 
matrix of polypeptides, for example, synthesized on the 
25 surface thereof, it is possible to determine which of the 
polypeptides is complementary to a fluorescently 
marked receptor. 
According to preferred embodiments, the intensity 

V. Details of One Embodiment of A Fluorescent De- « " «5f? lon of . the *PpKe* to the substrate is con- 
tection Device 30 trolled by varying the laser power and scan stage rate 

for improved signal-to-noise ratio by maximizing fluo- 



Number of celljo 60 
. Muk for residue I 

Number of stripes =» 1 
Width of each stripe— 20 
Stripe 1 begnu at location J and ends at 20 
For each of 3 building blocks, trxnsLue mask by 20 ceOU) 
Mask for residue 2 

Number of stripes** 3 
Width of each stripes 5 
Stripe I begins at location 1 and cods at 5 
Stripe 2 begins at location 21 and ends at 25 
Stripe 3 begins at location 41 and ends at 45 
For each of 4 bufldinf blocks, emulate mask by 5 cell(s) 
Mask for residue 3 

Number of stripes = 12 ' 
Width of each stripe^ 1 
Stripe 1 begins at location I and ends at I 
Stripe 2 begins at location 6 and ends al 6 
Stripe 3 begins at location 11 and ends at 1 1 
Stripe 4 begins at location 16 and ends at 16 
Stripe 5 begins at location 21 and ends at 21 
Stripe 6 begins at location 26 and ends at 26 
Stripe 7 begins at location 31 and ends at 31 
Stripe 8 begins at location 36 and ends at 36 
Stripe 9 begins at location 41 and ends at 41 
Stripe 10 begins at location 46 and ends at 46 
Stripe 11 begins at location 51 and ends at 51 
Stripe 12 begins at location 56 and ends at 56 
For each of 5 building blocks, translate by 1 cell(s) 
© Copyright 19901 Aflyoux Research Institute 



taction Device 

FIG. 9 illustrates a fluorescent detection device for 
detecting fluorescently labeled receptors on a substrate. 
A substrate 112 is placed on an x/y translation table 202. 
In a preferred embodiment the x/y translation table is a 
model no. PM500-A1 manufactured by Newport Cor- 
poration. The x/y translation table is connected to and 
controlled by an appropriately programmed digital 
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rescence emission and minimizing background noise. 

While the detection apparatus has been illustrated 
primarily herein with regard to the detection of marked 
receptors, the invention will find application in other 
areas. For example, the detection apparatus disclosed 
herein could be used in the fields of catalysis, DNA or 
protein gel scanning, and the"likc 



computer 204 which may be, for example an armronri- Sr7r gW scannjn & ^ a tDc 

atel/ programmed IBM Ka^It^SX „ EJ*?"" °* ***** BMflS ° f 
comnutiT nr ™nrro ~, ♦ . :__ . , w receptors 



computer. Of course, other computer systems, special 
purpose hardware, or the like could readily be substi- 
tuted for the AT computer used herein for illustration. 
Computer software for the translation and data collec- 
tion functions described herein can be provided based 
on commercially available software including, for ex- 
ample, "Lab Windows" licensed by National Instru- 
ments, which is incorporated herein by reference for all 
purposes. 

The substrate and x/y translation table are placed 
under a microscope 206 which includes one or more 
objectives 208. Light (about 488 nm) from a laser 210, 
which in some embodiments is a model no. 2020-05 
argon ion laser manufactured by Spectraphysics, is di- 
rected at the substrate by a dichroic mirror 207 which 
passes greater than about 520 nm light but reflects 488 
nm light Dichroic mirror 207 may be, for example, a 
model no. FT510 manufactured by Carl Zeiss. Light 



The signal-to-noise ratio of the present invention is 
sufficiently high that not only can the presence or ab- 
sence of a receptor on a ligand be detected, but also the 
relative binding affinity of receptors to a variety of 
45 sequences can be determined. 

In practice it is found that a receptor will bind to 
several peptide sequences in an array, but will bind 
much more strongly to some sequences **"»n others. 
Strong binding affinity will be evidenced herein by a 
50 strong fluorescent or radiographic signal since many 
receptor molecules will bind in a region of a strongly 
bound ligand. Conversely, a weak binding affinity will 
be evidenced by a weak fluorescent or radiographic 
signal due to the relatively small number of receptor 
55 molecules which bind in a particular region of a sub- 
strate having a ligand with a weak binding affinity for 
the receptor. Consequently, it becomes possible to de- 
termine relative binding avidity (or affinity in the case 



T*>n~*t~* . . — ©— *<_iauvc uunuiig avjaity tor ainnjcy in ine case 

5r*^L ^? J"*" ateR micrOSC °P e of univalent interactions) of a ligand hereto by way of 

?^2£2l5\ f< VT?^ 8 5, 0del °°- Axi0SC0p 60 the of a fluorescent or radiographic signal in a 

20 manufactured by Carl Zeiss. Fluorescein-marked region containing that ligand. 



materials on the substrate will fluoresce >488 nm light, 
and the fluoresced light will be collected by the micro- 
scope and passed through the mirror. The fluorescent 
light from the substrate is then directed through a wave- 
length filter 209 and, thereafter through an aperture 
plate 211. Wavelength filter 209 may be, for example, a 
model no. OG530 manufactured by Melles Griot and 



Semiquantitative data on affinities might also be ob- 
tained by varying washing conditions and concentra- 
tions of the receptor. This would be done by compari- 
65 son to known ligand receptor pairs, for example. 
VH. Examples 

The following examples are provided to illustrate the 
efficacy of the inventions herein. All operations were 
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conducted at about ambient temperatures and pressures is to be added, with appropriate washes to remove 

unleK indicated to the contrary. the by-products of the deprotection. 

A. Sude Preparation x Addition of a single activated and protected (with 

Before attachment of react) ve groups it is preferred to the same photochemicaUy-removable group) mon- 

ciean the substrate which is, in a preferred embodiment J omer, which wfll react only at the sites addressed 

f>;„ T SU ^ rat ! aS . a microsc °P c ^de or cover m step U with appropriate washes to remove the 

shp Accordmg to one embodiment the slide is soaked in excess reagent ta. the surface. 

K^S^S^T"? f f V CMm ?, < A 1 Uttr °l ™ e *™ » «» each member of the 

£?n hLT^V/ 08 ™^^ ,« monoiner set each location on the surface has been 

waTbS S ™ f 12 , h0UrS, ^ e ^ arC * en 10 extended by one residue in one embodiment In other 

wasned under runninc water and allowed to air drv . .. , , 

and rinsed once with t Son of Retool cmbodunents, several residues are sequentially added at 

The slides are then aminated with, for example, ^Tt ? ^ l0Ca v 0 * 

armnopropyhrietboxysilane for the propose of attach. ^ ^^.^ by 0001,11115 

ing amino groups to the glass surface on linker mole- 15 I10W i!?. short as . 20m ? 1In antomated P^ 

cules, although any omega functionalized silane could ??. synlh f lztrs ' ^ ^ 15 °P h0nal ly followed by 
also be used for this purpose. In one embodiment 0. 1 % addlUon of a protecting group to stabilize the array for 
ammopropyltrietboxysilane is utilized, although solu- . testinfi ' For som ? *yP cs of po'ymers (eg., pep- 
tions with concentrations from 10~ 7 % to 10% may be tide & a deprotection of the entire surface (removal 
used, with about 10-3% to 2% preferred. A 0.1% mix- 20 of P hoto P r otective side chain groups) may be required, 
ture is prepared by adding to 100 ml of a 95% More particularly, as shown in FIG. 10A, the glass 20 
ethanol/5% water mixture, 100 microliters (u,I) of * P* 0 ** 1 " 3 regions 22, 34, 26, 28, 30, 32, 34, and 
arninopropyltriethoxysilane. The mixture is agitated at ^ R e g* ons 30. 32, 34, and 36 are masked, as shown in 
about ambient temperature on a rotary shaker for about FIG. 10B and the glass is irradiated and exposed to a 
5 minutes. 500 jil of this mixture is then applied to the 25 reagent containg "A" (eg., gly), with the resulting 
surface of one side of each cleaned slide. After 4 min- structure shown b FIG. 10C Thereafter, regions 22, 
utes, the slides are decanted of this solution and rinsed 24* ^ and 28 are masked, the glass is irradiated (as 
three times by dipping in, for example, 100% ethanol. shown in FIG. 10D) and exposed to a reagent contain- 
After the plates dry, they are placed in a 1 10*-120* C ing "B" (e.g., phe), with the resulting structure shown 
vacuum oven for about 20 minutes, and then allowed to 30 in FIG. 10E The process proceeds, consecutively 
cure at room temperature for about 12 hours in an argon masking and exposing the sections as shown until the 
environment. The slides are then dipped into DMF structure shown in FIG. 10M is obtained. The glass is 
(chmethylformamide) solution, followed by a thorough irradiated and the terminal groups are, optionally, 
washing with methylene chloride. capped by acetylation. As shown, all possible trimers of 

The animated surface of the slide is then exposed to 35 gly/phc are obtained, 
about 500 ul of for example, a 30 imllimolar (mM) In this example, no side chain protective group re- 
"JJJon of NVOC-GABA (gamma amino butyric acid) moval is necessary. If it is desired, side chain deprotec- 
T wAr pT 01 DMF for attachment don may be accomplished by treatment with etbanedi- 

of a NVOC-GABA to each of the amino groups. thiol and trifluoroacetfc acid. 

nS^SL % f ° r DMF ' 40 1° number of steps needed to obtain a 

a£ h ** Cthan i 01 :, Particular polymer chain is defined by: 

Any unreacted aminopropyl silane on the surface — 

that is, those amino groups which have not had the nxi 0) 

NVOCOABA attached— are now capped with acetyl 

groups (to prevent further reaction) by exposure to a 1:3 45 where: 

mixture of acetic anhydride in pyridine for 1 hour. n=the number of monomers in the basis set of mono- 
Other materials which may perform this residual cap- mers, and 

ping function include trifluoroacetic anhydride, for- l=the number of monomer units in a polymer chain, 
micaceuc anhydride, or other reactive acylating agents. Conversely, the synthesized number of sequences of 
Finally, the slides are washed again with DMF, methy- 50 length 1 will be: 
lene chloride, and ethanol. 

B. Synthesis of Eight Trimers of "A** and "B" o'. (2) 

FIG. 10 illustrates a possible synthesis of the eight 
trimers of the^ two-monomer set: gly, phe (represented Of course, greater diversity is obtained by using 
by A and **B, W respectively). A glass slide bearing 55 masking strategies which will also include the synthesis 

•!? e J^/ U /? S terminatin £ ' m ^nitroveratryloxycarboxa- of polymers having a length of less than L If, in the 
mide (NVOC-NH) residues is prepared as a substrate, extreme case, all polymers having a length less than or 
Active esters (pentafluorophenyl, OBt, etc.) of gly and equal to 1 are synthesized, the number of polymers syn- 
pne protected at the amino group with NVOC are pre- thesUed will be: 
pared as reagents. While not pertinent to this example, if 60 

side chain protecting groups are required for the mono- n / +a / - 1 + . . . +n J . (3> 

mer set, these must not be photoreactive at the wave- 
length of light used to protect the primary chain. The maximum number of lithographic steps needed 

For a monomer set of size n, n X 1 cycles are required will generally be n for each "layer* of monomers, Le. f 
to synthesize all possible sequences of length 1. A cycle 65 the total number of masks (and, therefore, the number 
co ? s * ts °f : . of lithographic steps) needed will be nxL The size of 

I. Irradiation through an appropriate mask to expose the transparent mask regions wfll vary in accordance 
the ammo groups at the sites where the next residue with the area of the substrate available for synthesis and 
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the number of sequences to be formed. In general, the 
size of the synthesis areas will be: 

bm of synthesis area*-(A)/(Seqtieaces) 



26 



where: 

A is the total area available for synthesis; and 
Sequences is the number of sequences desired in the 
area. 

It will be appreciated by those of skill in the art that 
the above method could readily be used to simulta- 
neously produce thousands or millions of oligomers on 
a substrate using the photolithographic techniques dis- 
closed herein. Consequently, the method results in the 
ability to practically test large numbers of, for example, 
di, tri, tetra, penta, hexa, bepta, octapeptides, dodeca- 
peptides, or larger polypeptides (or correspondingly, 
polynucleotides). 

The above example has illustrated the method by way 
of a manua l example. It will of course be appreciated 
that automated or semi-automated methods could be 
used. The substrate would be mounted in a flow cell for 
automated addition and removal of reagents, to mini- 
mize the volume of reagents needed, and to more care- 
fully control reaction conditions. Successive masks 
could be applied manually or automatically. 

Synthesis of a Dimer of an Aminopropyl Group and 
a Fluorescent Group 

In synthesizing the dimer of an aminopropyl group 
and a fluorescent group, a functionalized durapore 
membrane was used as a substrate. The durapore mem- 
brane was a polyvinylidine difluoride with aminopropyl 
groups. The aminopropyl groups were protected with 
the DDZ group by reaction of the carbonyl chloride 
with the amino groups, a reaction readily known to 
those of skill in the art. The surface bearing these 
groups was placed in a solution of THF and contacted 
with a mask bearing a checkerboard pattefii of 1 mm 
opaque and transparent regions. The mask was exposed 
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pregnated with a known number of fluorescein mole- 
cules. 

Ooe of the beads was placed in the illumination Held 
on the scan stage as shown in FIG. 9 in a field of a laser 
spot which was initially shuttered. After being posi- 
tioned in the fllumination field, the photon detection 
equipment was turned on. The laser beam was un- 
blocked and it interacted with the particle bead, which 
then fluoresced. Fluorescence curves of beads impreg- 
nated with 7,000 and 13,000; fluorescein molecules, are 
shown in FIGS. 11A and UB respectively. On each 
curve, traces for beads without fluorescein molecules 
are also shown. These experiments were performed 
with 488 nm excitation, with 100 u*W of laser power. 
The light was focused through a 40 power 0.75 NA 
objective. 

The fluorescence intensity in all cases started off at a 
high value and then decreased exponentially. The fall- 
off in intensity is due to photobleacbtog of the fluores- 
cein molecules. The traces of beads without fluorescein 
molecules are used for background subtraction. The 
difference in the initial exponential decay between la- 
beled and nonlabeled beads is integrated to give the 
total number of photon counts, and this number is re- 
lated to the number of molecules per bead. Therefore, it 
is possible to deduce the number of photons per fluores- 
cein molecule that can be detected. For the curves 
illustrated in FIG. 11A and 11B, this calculation indi- 
cates the radiation of about 40 to 50 photons per fluores- 
cein molecule are detected. 

£. Determination of the Number of 

Molecules Per Unit Area 

Aminopropylated glass microscope slides prepared 
according to the methods discussed above were utilized 
in order to establish the density of labeling of the slides. 
The free ammo termini of the slides were reacted with 
FITC (fluorescein isothiocyanate) which forms a cova- 
lent linkage with the amino group. The slide is then 



to ultraviolet light having a wavelength down to at least im w ™ ™ ""no*™* slide is then 

about 280 nm for about 5 minutes at ambient tempera- 40 scanncd ]°. count . the ° u /° bcr . of fluorescent photons 
rare, although a wide ranee of exposure time* *nrt generated in a region which, usmg the estimated 40-50 



ture, although a wide range of exposure times and tem- 
peratures may be appropriate in various embodiments 
of the invention. For example, in one embodiment, an 
exposure time of between about 1 and 5000 seconds may 
be used at process temperatures of between —70* and 45 
+50* C 

In one preferred embodiment, exposure times of be- 
tween about 1 and 500 seconds at about ambient pres- 
sure are used. In some preferred embodiments, pressure 
above ambient is used to prevent evaporation. 50 

The surface of the membrane was then washed for 
about 1 hour with a fluorescent label which included an 
active ester bound to a chelate of a lanthanide. Wash 
times will vary over a wide range of values from about 
a few minutes to a few hours. These materials fluoresce 55 
in the red and the green visible region. After the reac- 
tion with the active ester in the fluorophorc was com- 
plete, the locations in which the fluorophore was bound 
could be visualized by exposing them to ultraviolet light 



photons per fluorescent molecule, enables the calcula- 
tion of the number of molecules which are on the sur- 
face per unit area. 

A slide with aminopropyl sOane on its surface was 
immersed in a 1 mM solution of FITC in DMF for 1 
hour at about ambient temperature. After reaction, the 
slide was washed twice with DMF and then washed 
with ethanol, water, and then ethanol again. It was then 
dried and stored in the dark until it was ready to be 
examined. 

Through the use of curves similar to those shown in 
FIG. 11A and UB, and by integrating the fluorescent 
counts under the exponentially decaying signal, the 
number of free amino groups on the surface after deri- 
vatization was determined It was determined that slides 
with labeling densities of 1 fluorescein per 10*X 10 3 to 
—2x2 nm could be reproducibly made as the concen- 
tration of anunopropyltriethoxysilane varied from 



and observing the red and the green fluorescence. It 60 \Q~ 5 % to 10-1%. 

was observed that the derivatized regions of the sub- F. Removal of NVOC and Attachment of A Fluores- 

strate closely corresponded to the original pattern of cent Marker 

the mask. NVOC-GABA groups were attached as described 

D. Demonstration of Signal Capability above. The entire surface of one slide was exposed to 

Signal detection capability was demonstrated using a 65 light so as to expose a free amino group at the end of the 

low-level standard fluorescent bead kit manufactured gamma amino butyric acid. This slide, and a duplicate 

by Flow Cytometry Standards and having model no. which was not exposed, were then exposed to fluorcs- 

824. This kit includes 5.8 pun diameter beads, each im- cein isothiocyanate (FITC). 
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FIG. 12A illustrates the slide which was not exposed Monomer-by-monomer synthesis of YGGFL and 

to light, but which was exposed to FTTC The units of GGFL in alternate squares was performed on a slide in 

the x axis are time and the units of the y axis are counts. a checkerboard pattern and the resulting slide was ex- 

The trace contains a certain amount of background posed to the Here antibody. This experiment and the 

fluorescence. The duplicate slide was exposed to 350 5 results thereof are illustrated in FIGS. UA, 14B, 15A, 

run broadband illumination for about I minute (12 and 1SB. 

mW/cm* -350 urn nomination), washed and reacted In FIG. MA, a slide is shown which is denvanzed 

with FTTC The fluorescence curves for this slide are with the aminopropyl group, protected m this case with 

shown in HG. 12B. A large increase in the level of t-BOC (t-butoxycarbonyl) The slide was treated with 

fluorescence is observed, which indicates photolysis has 10 TFA to remove the t-BOC protecting 8T°up- j*- 

exposed a number of amino groups on the surface of the anunocaproic acid, which was t-BOC protected at its 

slides for attachment of a fluorescent marker. ammo group, was then coupled onto the aminopropyl 

G. Use of a Mask in Removal of NVOC 8«>"P*- ™ c «mmocaproic and serve as a spacer be- 
Ttae next experiment was performed with a 0.1% the aminopropyl group and the pepde to be 

aminopropylated slide. Light from a Hg-Xe arc lamp 15 *yntbea*4 _The ammo end of 0* sp^w* de - 

was imaged onto the subsmte through a laser-ablated ^T^^T^^J^?^^^^ 

chrom^Sn-glass mask in direct contact with the sub- was then niunrn^ with 12mWof 3^""* 

°^ band fllununatioiL The suae was then coupled with 

s _f: ... - . j r *i < • NV(X>~phenYlalaniac and washed The entire slide was 

Tms s^e was iUummated for approximately 5 mm- ^Xniited. then coupled to NVOC-glycine and 

utes, with 12 mW of 350nm broadband light and then ' a ^f mummated ^tcc^td to 

reacted with the 1 mM FITC solution. It was put on the t0 form &e sequence shown in tie last 

laser detection scanning stage and a graph was plotted .ywtjnn of FIG 14A. 

85 , a ^r cnsional representation of position color- ^ shown m " mG 14B> alternating regions of the 

coded for fluorescence intensity. The fluorescence in- ^ ^ wcre ^ maah&U:d ^g a projection print 

tensity Cm counts) as a Whon rflwafion is given on . ft j00x 500 ^ checkerboard mask; thus, the 

the color scale to the ngbt of FIG. 13A for a mask of g , y ^ e w exposed olJy m ^ lighted 

Ami * 

. . areas. When the next coupling chemistry step was car- 

The experiment was repeated a number of times ried oul> NVOC-tyrosine was added, and it coupled 
through various masks. The fluorescence pattern for a 30 onJy ^ spots which had received illumination 
50 fim mask is illustrated in FIG. 13B, for a 20 jim mask ^ di(3e was then Ruminated to remove aH the 
in FIG. 13C, and for a 10 mask in FIG. 13D. The NVOC groups, leaving a checkerboard of YGGFL in 
mask pattern is distinct down to at least about 10 u.m ^ c ^ted areas and in the other areas, GGFL. The 
squares using this lithographic technique. Herz antibody (which recognizes the YGGFL, but not 

H. Attachment of YGGFL and Subsequent Exposure 35 GGFL) was then added, followed by goat anti-mouse 
to fluorescein conjugate. 

Herz Antibody and Goat Antimouse The resulting fluorescence scan is shown in FIG. 

In order to establish that receptors tc-a particular j5 A, and the color coding for the fluorescence intensity 
polypeptide sequence would bind to a surface-bound ^ * ga\ n given on the right Dark areas contain the tetra- 
peptide and be detected, Leu enkephalin was coupled to 4Q peptide GGFL, which is not recognized by the Herz 
the surface and recognized by an antibody. A slide was antibody (and thus there is no binding of the goat anti- 
derivatized with 0.1% amino propyl-triethoxysilane and mouse antibody with fluorescein conjugate), and in the 
protected with NVOC. A 500 um checkerboard mask re a areas YGGFL is present The YGGFL pentapep- 
was used to expose the slide in a flow cell using backside tide is recognized by the Herz antibody and, therefore, 
contact printing. The Leu enkephalin sequence (H2N- 45 there is antibody in the lighted regions for the fluore- 
tyrosine 1 glycine l glycme t phenylalanme4eudne-CO sceiii -conjugated goat anti-mouse to recognize, 

otherwise referred to herein as YGGFL) was attached Similar patterns are shown for a 50 pm mask used in 
via hs carboxy end to the exposed amino groups on the direct contact ("proximity print") with the substrate in 
surface of the slide. The peptide was added in DMF FIG. 15B. Note that the pattern is more distinct and the 
solution with the BOP/HOBT/DIEA coupling rea- 50 corners of the checkerboard pattern are touching when 
gents and recirculated through the flow cell for 2 hours the mask is placed in direct contact with the substrate 
at room temperature. (which reflects the increase in resolution using this 

A first antibody, known as the Herz antibody, was technique), 
applied to the surface of the slide for 45 minutes at 2 J. Monomer-by-Monomer Synthesis of YGGFL and 
p.g/ml in a supercocktail (containing 1% BSA and Wo 55 PGGFL 

ovalbumin also in this case). A second antibody, goat A synthesis using a 50 u>m checkerboard mask similar 
anti-mouse fluorescein conjugate, was then added at 2 to that shown in FIG. 15B was conducted. However, P 
fig/ml in the supercocktail buffer, and allowed to tncu- was added to the GGFL sites on the substrate through 
bate for 2 hours. An image taken at 10 urn steps indi- an additional coupling step. P was added by exposing 
cated that not only can deprotection be carried out in a 60 protected GGFL to light and subsequent exposure to P 
well defined pattern, but also that (1) the method pro- in the manner set forth above. Therefore, half of the 
vides for successful coupling of peptides to the surface regions on the substrate contained YGGFL and the 
of the substrate, (2) the surface of a bound peptide is remaining half contained PGGFL. 
available for binding with an antibody, and (3) that the The fluorescence plot for this experiment is provided 
detection apparatus capabilities are sufficient to detect 65 in FIG. 16. As shown, the regions are again readily 
binding of a receptor. discernable. This experiment demonstrates that antibod- 

I. Monomer-by-Monomer Formation of YGGFL and ies are able to recognize a specific sequence and that the 
Subsequent Exposure to Labeled Antibody recognition is not length-dependent 
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^Monomer-by-Monomer Synthesis of YGGFL and TABLE Utinued 

In order to further demonstrate the operabSity of the Apparent Binding to Herz Ab 

invention, a 50 urn checkerboard pattern of alternating L**Sct D-a^Sct ■ 

YGGFL and YPGGFL was synthesized on a substrate 5 waGFL 

using techniques like those set forth above. The result- " ' 

SSS^^ "i^r trat ' Ve A * t ? 71at ' ve Embodiment 

YGGFL *m*s*^!?JrZ k- a - to .? 00 5 ulc * e According to an alternative embodiment of the in- 

^(^ rS! bmd at 10 vention, the methods provide for attaching to the sur- 

t c,~a * r a face a caged binding member which in its cased form 

Add S^uTnJZ^ °H SUt Tp D fr nt x^° te 3 rdativel y l0W for Potentially bind- 



an^of SZT , ° ^ ** 411 I3 According to this alternative embodiment, the inven- 

S^fi? different ainino acid sequences (replicated 13 &» provides methods for forming predefined regions 
«ZT£ ^ SynthCSUCd 011 two ^ 5ub - OT 2 surface of a solid support, wherein the predefined 

synthesized by attaching regions are capable of immobilizing receptors. The 
T» T n NVOC -p FL r across the entire surface of methods make use of caged binding members attached 
me aides. Using a series of masks, two layers of amino to the surface to enable selective activation of the pre- 
acids were then selectively applied to the substrate. M defined regions. The caged binding members are liber. 
Jbacn region had dimensions of 0.25 cm X 0.0625 cm. ated to act as binding members ultimately capable of 
. contained amino acid sequences contain- binding receptors upon selective activation of the pre- 

ing only L amino acids while the second slide contained denned regions. The activated binding members are 
selected D amino acids. FIGS. ISA and 18B illustrate a 25 then used to immobilize specific molecules such as re- 
map of the various regions on the first and second slides, ceptors on the predefined region of the surface. The 
respectively. The patterns shown in FIGS. 18A and above procedure is repeated at the same or different 
18B were duplicated four times on each slide. The slides sites on the surface so as to provide a surface prepared 
were then exposed to the Herz antibody and fluore- a plurality of regions on the surface containing, for 

scem-labeled goat anti-mouse. ^ example, the same or different receptors. When recep- 

FIG. 19 is a fluorescence plot of the first slide, which tors immobilized in this way have a differential affinity 
contained only L amino acids. Red indicates strong for one or more ligands, screenings and assays for the 
binding (149,000 counts or more) while black indicates ligands can be conducted in the regions of the surface 
little or no binding of the Herz antibody (20,000 counts containing the receptors. 

or less). The bottom right-hand portion of the slide 35 The alternative embodiment may make use of novel 
appears ^"cut ofT because the slide was broken during ^ed binding members attached to the substrate, 
processing. The sequence YGGFL is clearly most Caged (unactivated) members have a relatively low 
strongly recognized. The sequences YAGFL and affinity for receptors of substances that specifically bind 
YSGFL also exhibit strong recognition of the antibody. t0 ^caged binding members when compared with the 
By contrast, most of the remaining sequences show little 40 corre sponding affinities of activated binding members, 
or no binding. The four duplicate portions of the slide Thus> ^ binding members are protected from reaction 
are extremely consistent in the amount of binding u^til a suitable source of energy is apphed to the regions 
shown therein. of the surface desired to be activated. Upon application 

FIG. 20 is a fluorescence plot of the second slide. of a suit ablc energy source, the caging groups labilize, 
Again, strongest binding is exhibited by the YGGFL 43 ^"^y presenting the activated binding member. A 
sequence. Significant binding is also detected to typical energy source will be light. 
YaGFL, YsGFL, and YpGFL (where L-amino acids 0x106 Ac bmdin S members on the surface are acti- 
are identified by one upper case letter abbreviation and V f Ud they may attached 10 a receptor. The receptor 
D-amino acids are identified by one lower case letter chosen ma y a monoclonal antibody, a nucleic acid 
abbreviation). The remaining sequences show less bind- 50 *$ aencc > a receptor, etc The receptor will usu- 
ing with the antibody. Note the low binding efficiency *' ? ot f»*y*> Prepared so as to permit 

of the sequence yGGFL. attaching it, directly or indirectly, to a binding member. 

Table 6 fists the various sequences tested in order of 

example, a specific binding substance having a 

relative fluorescence, which provides information re- T* 8 ^ S ° T hmdili & member and a 

garding relative binding affinity 53 005 affinity for the receptor or a conjugate of the 

receptor may be used to act as a bridge between binding 
TABLE 6 members and receptors if desired The method uses a 

receptor prepared such that the receptor retains its 
activity toward a particular ligand. 

60 Preferably, the caged binding member attached to the 

solid substrate will be a photoactivatable biotin com- 
plex, ie., a biotin molecule that has been chemically 
modified with photc*ctrvatable protecting groups so 
that it has a significantly reduced binding affinity for 
65 avidin or avid in analogs than does natural biotin. In a 
preferred embodiment, the protecting groups localized 
in a predefined region of the surface will be removed 
upon application of a suitable source of radiation to give 



Apparent Bindia* to Ah 
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VAGFL 
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binding members, that are biotin or a functional] y analo- 
gous compound having substantially the «*mf binding 
affinity for avidin or avidin analogs as does biotin. 

In another preferred embodiment, avidin or an avidin 
analog is incubated with activated binding members on 
the surface until the avidin binds strongly to the binding 
members. The avidin so immobilized on predefined 
regions of the surface can then be incubated with a 
desired receptor or conjugate of a desired receptor. The 
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3. The substrate as recited in claim 1 wherein said 
substrate comprises 10 s or more different groups of 
oligonucleotides with known sequences in discrete 
known regions. 

4. The substrate as recited in claim 1 wherein said 
substrate comprises 10* or more different groups of 
oligonucleotides with known sequences in discrete 
known regions. 

5. The substrate as recited in claim 1 wherein said 



r v _ wvujw^hw w* •» w a v»f/w4 . 4 ut> - — — — — " m**rm w». w.i w 

receptor will preferably be biotinylated, e.g„ a hi- 10 groups of oligonucleotides are at least 50% pure within 

a A. _ _ _ .A ^ J _ * * • * * •* m _ m\ * J ^ - - ^ _ - ! — _ 



otinylated antibody, when avidin is immobilized on the 
predefined regions of the surface. Alternatively, a pre- 
ferred embodiment will present an avidin/biotinylated 
receptor complex, which has been previously prepared, 
to activated bindmg members on the surface. 
IX. Conclusion 

The present inventions provide greatly improved 
methods and apparatus for synthesis of polymers on 
substrates. It is to be understood that the above descrip- 



said discrete known regions. 

6. The substrate as recited in claim 1 wherein the 
groups of oligonucleotides are attached to the surface 
by a linker. 

15 7. An array of more than 1,000 different groups of 
oligonucleotide molecules with known sequences cova- 
lently coupled to a surface of a substrate, said groups of 
oligonucleotide molecules each in discrete known re- 
gions and differing from other groups of oligonucleo- 



tion is intended to be illustrative and not restrictive- 20 tode molecules in monomer sequence, each of said dis- 
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Many embodiments will be apparent to those of skill in 
the art upon reviewing the above description. By way 
of examp le, the invention has been described primarily 
with reference to the use of photoremovable protective 
groups, but it win be readily recognized by those of skill 
in the art that sources of radiation other than ligh t could 
also be used. For example, in some embodiments it may 
be desirable to use protective groups which are sensi- 
tive to electron beam irradiation, x-ray irradiation, in 
combination with electron beam lithograph, or x-ray 
lithography techniques. Alternatively, the group could 
be removed by exposure to an electric current The 
scope of the invention should, therefore, be determined 
not with reference to the above description, but should 35 
instead be determined with reference to the appended 
claims, along with the full scope of equivalents to which 
such claims are entitled. **- 
What is claimed is: 

3. A substrate with a surface comprising 10 3 or more 4$ 
groups of oligonucleotides with different, known se- 
quences covalently attached to the surface in discrete 
known regions, said 10 3 or more groups of oligonucleo- 
tides occupying a total area of less than 1 cm 2 on said 
substrate, said groups of oligonucleotides having differ- 45 
ent nucleotide sequences. 

2. The substrate as recited in claim 1 wherein said 
substrate comprises 10 4 or more different groups of 
oligonucleotide with known sequences covalently cou- 
pled to discrete known regions of said substrate. 50 



crete known regions being an area of less about 
0.01 cm 2 and each discrete known region comprising 
oligonucleotides of known sequence, said different 
groups occupying a total area of less than 1 cm 2 . 

S. The array as recited in claim 7 wherein said area is 
less than 10,000 microns 2 . 

9. The array as recited in claim 7 made by the process 
of: 

exposing a first region of said substrate to light to 
remove photoremovable groups from nucleic acids 
in said first region, and not exposing a second re- 
gion of said surface to light; 

covalently coupling a first nucleotide to said nucleic 
acids on said part of said substrate exposed to light, 
said first nucleotide covalently coupled to said 
photoremovable group; 

exposing a part of said first region of said substrate to 
light, and not exposing another part of said first 
region of said substrate to light to remove said 
photoremovable groups; 

covalently coupling a second nucleotide to said part 
of said first region exposed to light; and 

repeating said steps of exposing said substrate to light 
and covalently coupling nucleotides until said 
more than 500 different groups of nucleotides are 
formed on said surface. 

10. The array as recited in claim 7 comprising more 
than 10,000 groups of oligonucleotides of known se- 
quences. 



ss 



60 



63 





i nm Diiiiii id nil mil ion imt iiii 11 iiu eii nun m on m 



United States Patent m 

Lockhart et al. 



[54] SURFACE-BOUND, UNIMOLECULAR, 
DOUBLE-STRANDED DNA 

[75] Inventors: David X Lockhart, Santa Clara. Calif.; 

Dirk Yetter, Freiburg. Germany; 
Martin Diggelmann, Niederdorf, 
Switzerland 

[73] Assignee: Affyraetrix, Inc, Santa Clara, Calif. 

[21] Appl. No.: 327,687 
[22] filed: Oct 24, 1994 

[51] Int CI. 6 J. — C12Q 1/68; C07H 21/00 

[52J VS. CI 435/6; 536/23.1 

[58] Field of Search 435/6; 536/23.1; 

530/4 1 3 

[56] References Cited 

U.S. PATENT DOCUMENTS 

O76.U0 VI 9 S3 David et iL 4 3V5 

4,562,157 12/1985 Lowe et al. .... 435/287.2 

4.728 .502 3/1988 Hamill 422/116, 

5!l43.854 9/1992 Pimmgeia!. 436T5I8 

5,288,514 2/1994 Ellaun 165/155 

FOREIGN PATENT DOCUMENTS 

WO89/10977 11/1989 WlPO . 

W089/11548 U/1989 WlPO. 

WO90/00626 1/1990 WlPO . 

WO9IV15070 12/1990 WlPO. 

W 09 2/0009 1 1/1992 WlPO . 

OTHER PUBLICATIONS 

Duncan, C. H. etal (1988) Analytical Biochemistry 169: 
104-108. "Affinity Chromatography of a Sequence-specific 
DNA binding protein using Teflon linked ...**. 

Ma, M. Y.-X. ci al (1993) Biochemistry 32: 1751-1758. 
"Design k Synthesis orRNA Miniduplicates via a synthetic 
linker approach-"Markicwicz, W T et al (1989) Nucleic 
Acids Research 17: 7149-7157. "Universal solid supports 
for the synthesis of oligonucleotides with 3 - P0 4 s". 



IIIIIIL 

US005556752A 
[li] Patent Number: 
(45] Date of Patent: 



5,556,752 
Sep- 17, 1996 



Ohlmeyer. MK J etal (1993) Proc Natl. Acad. Sri. USA 90: 
10922-10926 "Complex Synthetic Chemical Libraries 
Indexed with molecular Tags "Gey sen, et al., J. Immun 
Mtih 101259-274(1987). 

Frank and Doring, Tetrahedron. 44:6031-6040 (1988). 
Fodor et at. Science, 251:767-777 (1991). 
Lara et al.. Nature. 354:82-84 (1991). 
Koughien et al.. Nature, 354:84-86 (1991). 
Galas et al.. Nucleic Acid Res. 5(9):3 157-3 170 (1978) 
Murphy et al.. Science 262:1025-1029 (1993). 
Lysov et al- DoYl. Akad Nauk SSSR, 303:1508-151 1 (1988) 
(See footnote provided, P. 436). 
Bains el al.. J. Theor. Biol, 135:303-307 (1988). 
Drmanac et al., Genomics. 4:114-128 (1989). 
Strezoska et al., Proc Natl Acad. Sci 
88:10089-10093 (1991). 

Drmanac et al.. Science. 260:1649-1652 (1993). 
Needels. et al.. Prvc. Natl Acad. ScL 
90:10700-10704 (1993). 

Scaria, P. V, et al. I of Biol Chem., 266(9) : 5417-5423 
(1993). 

(List continued on next page.) 

Primary Examiner— Mmdy Heisher 

Assistant Examiner— ScoU. Davjd Pricb: 

Attorney. Agent, or Firm— Townsend and Townsend and 

Crew LLP 



USA, 



USA. 



(57] 



ABSTRACT 



Libraries of unimolecular. double-stranded oligonucleotides 
on a solid support These libraries are useful in pharmaceu- 
tical discovery for the screening of numerous biological 
samples for specific interactions between the double- 
stranded oligonucleotides, and peptides, proteins, cinjgs and 
RNA. In a related aspect ihc present invention provides 
libraries of con formation ally restricted probes on a solid 
support. The probes arc restricted in their movement and 
flexibility using doublc-sirandcd oligonucleoudes as scaf- 
folding. The probes arc also useful in various screening 
procedures associated with drug discovery and diagnosis. 
The present invention further provides methods for the 
preparation and screening of the above libraries. 

6 Claims, 1 Drawing Sheet 
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SURFACE-BOUND, UMMOLECULAR, 
DOUBLE-STRANDED DNA 

GOVERNMENT RIGHTS 

Research leading to the invention was funded in part by 
NIH Gran; No. R01HGO0813-03 and the government may 
have certain fights to the invention. 

BACKGROUND OF THE INVENTION 

The present invention relates to the field of polymer 
synthesis and the use of polymer libraries for biological 
screening. More specifically, in one embodiment the inven- 
tion provides arrays of diverse double-stranded oligonucle- 
otide sequences. In another embodiment, the invention pro- 
vides arrays of conformadonally restricted probes, wherein 
the probes are held in position using double-stranded DNA 
sequences as scaffolding, libraries of diverse unimolecular 
double-stranded nucleic acid sequences and probes may be 20 
used, for example, in screening studies for determinauon of 
binding affinity exhibited by binding proteins, drugs, or 
RNA 



Methods of synthesizing desired single stranded DNA y 
sequences are well known to those of skill in the art, In 
particular, methods or synthesizing oligonucleotides are 
found in. for example, Oligonucleotide Synthesis: A Prac- 
tical Approach, Gail, ed., IRL Press, Oxford (1984). incor- 
porated herein by reference in its entirety for all purposes. ^ 
Synthesizing unimolecular double-stranded DNA in solution 
has also been described. Sec. Durand, et al. Nucleic Acids 
Res. 18:6355-6359 (1990) and Thomson, et al. Nucleic 
Acids Res. 21:5600-5603 (1993). the disclosures of both 
being incorporated herein by reference. 3J 

Solid phase synthesis of biological polymers has been 
evolving since the early M MerrifieId M solid phase peptide 
synthesis, described in Merrifield, X Am. Chem. Soc. 
85:2149-2154 (1963). incorporated herein by reference for 
all purposes. Solid-phase synthesis techniques have been ^ 
provided for the synthesis of several peptide sequences on. 
for example, a number of "pins." See e.g.. Geysen el aU / 
Immun. Me/A. 102:259-274 (1987). incorporated herein by 
reference for all purposes. Other solid-phase techniques 
involve, for example, synthesis of various peptide sequences 45 
on different cellulose disks supported in a column. See Frank 
and Doring. Tetrahedron 44:6031-6040 (1988). incorpo- 
rated herein by reference for all purposes. Still other solid- 
phase techniques arc described in U.S. Pal. No. 4.728.502 
issued to Hamill and WO 9G/00626 (Bcauie. inventor). 30 

Each of the above techniques produces only a relatively 
low density amy of polymers. For example, the technique 
described in Ceysen et al. is limited to producing 96 
different polymers on pins spaced in the dimensions of a 
standard microtiicr plate. 35 

Improved methods or forming large arrays of oligonucle- 
otides, peptides and other polymer sequences in a short 
period of dmc have been devised. Of particular note. Pirrung 
et al., U.S. Pat. No. 5,143,854 (see also PCT Application No. 
WO 90715070) and Fodor et al.. PCT Publication No. WO 60 
92/10092. all incorporated herein by reference, disclose 
methods of forming vast arrays of peptides, oligonucleotides 
and other polymer sequences using, for example, light- 
directed synthesis techniques. See also. Fodor ct al.. Science, 
251:767-777 (1991). also incorporated herein by reference 65 
for all purposes. These procedures are now referred to as 
VLSIPS™ procedures. 



In the above-referenced Fodor et aL. PCT application, an 
elegant method is described for using a computer-controlled 
system to direct a VLSIPS™ procedure. Using this 
approach, one heterogenous array of polymers is converted, 
through simultaneous coupling at a number of reaction sites, 
into a different heterogenous array. See. U.S. Pat. No. 
5 384.261 and U.S. application Ser. No. 07/980,523, the 
disclosures of which are incorporated herein for all pur- 
poses. 

The development of VLSEPS™ technology as described 
in the above-noted U.S. PaL No. 5,143,854 and PCT patent 
publication Nos. WO 9G715070 and 92/1 0092. is considered 
pioneering technology in the fields of combinatorial synther 
sis and screening of combinatorial libraries. More recently, 
patent application Set No. 0*082,937. filed Jun. 25, 1993 
now abandoned, describes methods for making arrays of 
oligonucleotide probes that can be used to check or deter- 
mine a partial or complete sequence of a target nucleic acid 
and to detect the presence of a nucleic acid containing a 
specific oligonucleotide sequence. 

A number or biochemical processes of pharmaceutical 
interest involve the interaction of some species, e.g., a drug, 
a peptide or protein, or RNA, with double-stranded DNA. 
For example. protcLVDNA binding interactions are involved 
with a number of transcription factors as well -as tumor 
suppression associated with the p53 protein and the genes 
contributing to a number of cancer conditions. 



SUMMARY OF THE INVENTION 

High-density arrays of diverse unimolecular, double- 
stranded oligonuclcoudes, as well as arrays of conforma- 
donally restricted probes and methods for their use are 
provided by virtue of the present invention. In addiuon 
methods and devices for detecting duplex formation of 
oligonucleotides on an array of diverse single-stranded 
oligonucleoUdes arc also provided by this invention. Fur- 
ther, an adhesive based on the specific binding characteris- 
tics of two arrays of complementary oligonuclcoudes is 
provided in the present invention. 

According to one aspect or the present invention, libraries 
of unimolecular, double-stranded oligonucleotides arc pro- 
vided Each member of the library is comprised of a solid 
support, an optional spacer for attaching the doublcsirandcd 
oligonucleotide to the support and for providing sufficient 
space between the double-stranded oligonuclcoudc and the 
solid support for subsequent binding studies and assays, an 
oligonucleotide attached to the spacer and further attached to 
a second complementary oligonucleotide by means of a 
flexible linker, such that the two oligonucleotide portions 
exist in a double-stranded configuration. More particularly, 
the members of the libraries of the present invention can be 
represented by the formula: 

in which Y is a solid support. L 1 is a bond or a spacer. L 2 is 
a flexible linking group.' and X 1 and X 2 are a par of 
complementary oligonucleotides. 

In a specific aspect of the invention, the library of 
different unimolecular. double-stranded oligonucleoUdes 
can be used for screening a sample for a species which binds 
to one or more members of the library. 

In a related aspect of the invention, a library of diBerem 
cc^ormatioiially-restriaed probes attached to a solid sup- 
port is provided. Hie individual members each have the 
formula: 



5,556,752 



in which X" and X" are complementary oligonucleotides 
and 2 is a probe having sufficient length such thai X 11 and 
X 13 form a double-siranded oligonucleotide portion of the 5 
member and thereby restrict the conformation* available to 
the probe. In a specific aspect of the invention, the library of 
different conforrnaiMnally-restricted probes can be used for 
screening a sample for a species which binds 10 one or more 
probes in the library. jq 

According to yet another aspect of the present invention, 
methods and devices for the bioclecuonic detection of 
duplex formation arc provided 

According 10 still another aspect oT the invention, an 
adhesive is provided which comprises two surfaces of 13 
complementary oligonucleotides. 

BRIEF PESCRJFnON OF THE DRAWINGS 

FIGS. 1 A to IF illustrate the preparation of a member of 
a library of surface-bound, unimolecular double-stranded 
DNA as well as binding studies with receptors having 
specificity for cither the double stranded DNA portion, a 
probe which is held in a conformational ly restricted form by - 
DNA scaffolding, or a bulge or loop region of RNA. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

Abbreviations 

The following abbreviations arc used herein: phi. phenan- 30 
threncquinone diiminc; phen', 5*amido-glutaric acid-1,10- 
phcnanthrolinc; dppz, dipyridophenazinc. 
Glossary 

The following terms arc intended to have the following 
general meanings as they arc used herein: 3 s 

Chemical terms: As used herein, the term "alky!" refers to 
a saturated hydrocarbon radical which may be straight-chain 
or branched-chain (for example, ethyl, isopropyt. l-amyl, or 
2,5-dirnclhylhcxyl). When "alky!" or "alkylcnc" is used to 
refer to a linking group or a spacer, it is taken to be it group « 
having two available valences for covalcnt attachment, for 
cxampl:. -CH 2 CH 2 — . -CHjCH 2 CH 2 -. 
— CH 2 CH 2 CH(CH3)CHj— and — CHjCCHjCH^CHj— . 
Preferred alkyl groups as substitucnis arc those containing 1 
to 10 carbon atoms; with those containing I to 6 carbon 4$ 
atoms being particularly preferred. Preferred alkyl or alky- 
lcnc groups as Unking groups arc those containing I to 20 
carbon atoms, with those containing 3 to 6 carbon atoms 
being particularly preferred. The term "polyethylene glycol" 
is used to refer to those molecules which have repeating 50 
units of ethylene glycol, for example, hcxacthylcnc glycol 
(HO— <CHjCH J 0) 5 — CHjCHjOH). When the term "poly- 
ethylene glycol" is used to refer to linking groups and spacer 
groups, it would be understood by one of skill in the an that 
other polycthcn or polyols could be used as well (i. c, 53 
polypropylene glycol or mixtures of ethylene and propylene 
glycols). 

The term "protecting group** as used herein, refers 10 any 
of the groups which are designed to block one reactive site 
in a molecule while a chemical reaction is carried out at 60 
another reactive site. More particularly, the protecting 
groups used herein can be any of those groups described in 
Greene, tt al.. Protective Croups In Organic Chemistry, 2nd 
Ed., John Wiley ft Sons. New York, N.Y, 1991, incorporated 
herein by reference. The proper selection of protecting 65 
groups for a particular synthesis will be governed by the 
overall methods employed in the synthesis. For example, in 



-Ught-dircaed" synthesis, discussed below, the protecting 
groups will be photolabilc protecting groups such as NVOC, 
MeNPOC, and those disclosed in co-pending Applicarion 
PCT/US93/10162 (filed Ocl 22. 1993), incorporated herein 
by reference. In other methods, protecting groups may be 
removed by chemical methods and include groups such as 
FMOC DMT and others known to those of skill in the an. 

Complementary or substantially complementary: Refers 
to the hybridization or base pairing between nucleotides or 
nucleic acids, such as, for instance, between the two strands 
or a double stranded DNA molecule or between an oligo- 
nucleotide primer and 1 primer binding site on a single 
stranded nucleic add to be sequenced or amplified Comple- 
mentary nucleotides arc, generally. A and T (or A and U). or 
C and G. Two tingle stranded RNA or DNA molecules are 
said to be substantially complementary when the nucleotides 
of one strand, optimally aligned and compared and with 
appropriate nucleotide insertions or deletions, pair with at 
least about 80% of the nuelcotides of the other strand, 
usually at least about 909b to 95%, and more preferably from 
about 98 to 100%. 

Alternatively, substantial complementary exists when an 
RNA or DNA strand will hybridize under selective hybrid- 
ization conditions to its complement. Typically, selective 
hybridization will occur when there is at least about 65% 
complementary over a stretch of at least 14 to- 25 nucle- 
otides, preferably at least about 75%, more preferably at 
least about 90% complementary. S. ee, M. Kanchisa Nucleic 
Acids Res. 12:203 (1984], incorporated herein by reference. 

Stringent hybridization conditions will typically include 
salt concentrations of less than about 1M, more usually less 
than about 500 mM and preferably less than about 200 mM. 
Hybridization temperatures can be as low as 5° C, but arc 
typically greater than 22 # C, more typically greater than 
about 30* C. and preferably in excess of about 37° C. 
Longer fragments may require higher hybridization tem- 
peratures for specific hybridization. As oihcr factors may 
atlcc; the stringency, of hybridization, including base com- 
position and length of the complementary strands, presence 
or organic solvents and cxtcn; of base mismatching, th: 
combination of parameters is more important than the abso- 
lute measure of any one alone. 

Epitope: The portion of an antigen molecule which is 
delineated by the area of interaction with the subclass of 
receptors known as antibodies. 

Identifier tag: A means whereby one can identify which 
molecules have experienced a particular reaction in the 
synthesis of an oligomer. The identifier tag also records the 
jtcp in the synthesis scries in which the molecules experi- 
enced that particular monomer reaction. The identifier tag 
may be any recognizable feature which is. for example: 
microscopically distinguishable in shape, size, color, optical 
density, etc.; differently absorbing or emitting of light; 
chemically reactive; magnetically or electronically encoded; 
or in some other way distinctively marked with the required 
information. A preferred example of such an identifier tag is 
an oligonucleotide sequence. 

Ugand/Probc: Aligand is a molecule that is recognized by 
a particular receptor. The agent bound by or reacting with a 
receptor is called a "ligand." a term which is definiiionally 
meaningful only in terms of its counterpart receptor. The 
term "ligand" docs not imply any particular molecular size 
or other structural or compositional feature other than that 
the substance in question is capable of binding or otherwise 
interacting with the receptor. Also, aligand may serve cither 
as the natural tigand to which the receptor binds, or as a 
functional analogue that may act as an agonist or antagonist. 
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Examples of ligands thai can be investigated by this inven- 
tion include, but are not restricted to, agonists and antago- 
nists for cell membrane receptors, toxins and venoms, viral 
epitopes, hormones (e.g., opiates, steroids, etc.), hormone 
receptors, peptides, enzymes, enzyme substrates, substrate 
analogs, transition state analogs, co factors, drugs, proteins, 
and antibodies. The term "probe" refers to those molecules 
which are expected to act like ligands but for which binding 
informaiicn is typically unknown. For example, if a receptor 
is known to bind a ligand which is a peptide P-tum, a 
"probe" or library of probes will be those molecules 
designed to mimic the peptide p-tturn. In instances where the 
particular ligand associated with a given receptor is 
unknown, the term probe refers to those molecules designed 
as potential Uganda for the receptor. 

Monomer Any member of the set of molecules which can 
be joined together to form an oligomer or polymer. The set 
of monomers useful in the present invention includes, but is 
not restricted to. for the example of oligonucleotide symhc 
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Chem, Soc. 103:3185 (1981). both incorporated herein by 
reference, or by other chemical methods using either a 
commercial automated oligonucleotide synthesizer or 
VLSIPS 11 * technology (discussed in detail below). "When 
oligonucleotides are referred to as "double-stranded," it is 
understood by those of skill in the an that a pair of 
oligonucleotides exist in a hydrogen-bonded, helical array 
typically associated with, for example, DNA. la addition to 
the 100% complementary form of double-stranded oligo- 
nucleotides, the term "double-stranded** as used herein is 
also meant to refer to those forms which include such 
structural features as bulges and loops, described more fully 
in such biochemistry texts as Stryer, Biochemistry. Third 
Ed.. (1988), previously incorporated herein by reference for 

all purposes. m 

Receptor A molecule that has an affinity for a given 
ligand or probe. Receptors may be namrally-oecurring or 
man made molecules. Also, they can be employed in their 
unaltered natural or isolated state or as aggregates with other 
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be employed by this invention include, but are not restricted 
to, antibodies, cell membrane receptors, monoclonal anti- 
bodies and ami sera reactive with specific antigenic deter- 
minants (such as on viruses, cells or other materials), drugs, 
polynucleotides, nucleic acids, peptides, cof actors, lectins, 
sugars, polysaccharides, cells, cellular membranes, and 
organelles. Receptors are sometimes referred to in the art as 
anti-ligands. As the term receptors is used herein, no differ- 
ence in meaning is intended. A "iigand-receptor pair" is 
formed when two molecules have combined through 
molecular recognition to form a complex. Other examples of 
receptors which can be investigated by this invention 
include but arc not restricted to: 

a) Microorganism receptors: Determination of ligands or. 
probes that bind to receptors, such as specific transport 
proteins or enzymes essential to survival of microor- 
ganisms, is useful in a new class of antibiotics. Of 
particular value would be antibiotics against opporm- 
nisiic fungi, protozoa, and those bacteria resistant to the 
antibiotics in current use. 

b) Enzymes: For instance, the binding site of enzymes 
such as the enzymes responsible for cleaving neu- 
rotransmitters. Determination of ligands or probes that 
bind to certain receptors, and thus modulate the action 
of the enzymes thai cleave the different neurotransmit- 
ters, is useful in the development of drugs trot can be 
used in the ircaimcnt of disorders of neurotransmission. 

c) Antibodies: For instance, the invention may be useful 
in investigating the ligand -binding site on the antibody 
molecule which combines with the epitope of an ami- 
gen of interest. Determining a sequence tha; mimics an 
antigenic epitope may lead to the development of 
vaccines of which the immunogen is based on one or 
more of such sequences, or lead to the development of 
related diagnostic agents or compounds useful in thera- 
peutic treatments such as for autoimmune diseases 
(e.g.. by blocking the binding of the "self antibodies). 

d) Nucleic Acids: The invention may be useful in inves- 
tigating sequences of nucleic acids acting as binding 
sites for cellular proteins ("trans-acting factors"). Such 
sequences may include, e.g., transcription factors, sup- 
pressors, enhancers or promoter sequences. 

c) Catalytic Polypeptides: Polymers, preferably polypep- 
tides, which are capable of promoting a chemical 
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tivcly) and synthetic analogs thereof. As used herein, mono- 
mers refers to any member of a basis set for synthesis of an 
oligomer. Different basis sets of monomers may be used at 
successive steps in the synthesis of a polymer. 

Oligomer or Polymer. The oligomer or polymer 
sequences of the present invention are formed from the 
chemical or enzymatic addition or monomer subunils. Such 
oligomers include, for example, both linear, cyclic, and 
branched polymers of nucleic acids, polysaccharides, phos- 30 
pholipids, and peptides having cither a% p-, or w-amino 
acids, hcxrcpolymcrs in which a known drug is covalcnUy 
bound to any of the above, polyurethanes, polyesters, poly, 
carbonates, poly ureas, polyamides, polyethylenei mines, 
polyarylcne sulfides, poiysiloxanes, polyimides. polyac- 
ctatcs, or oihcr polymers which will be readily apparent to 
one skilled in the art upon review of this disclosure. As used 
herein, the term oligomer or polymer is meant to include 
such molecules as p-turn mimctics, prostaglandins and ben- 
zodiazepines which can also be synthesized in a stepwise 
fashion on a solid support. 

Peptide: A peptide is an oligomer in which the monomers 
arc amino acids and which are joined together through 
amide bonds and alternatively referred lo as a polypeptide. 
In the context of this specification it should be appreciated 45 
thai when a-amino acids are used, they may be the L-optical 
isomer or the D-optical isomer. Other amino acids which arc 
useful in the present invention include unnatural amino acids 
such a p-alaninc, phcnylglycinc, homoarginine and the like. 
Pepdde s arc more than two amino acid monomers long, and 50 
often more than 20 amino acid monomers long. Standard 
abbreviations for amino adds arc used (e.g., P for proline). 
These abbreviations are included in Stryer. Biochemistry, 
Third Ed. (1 988), which is incorporated herein by reference 

for all purposes. 

Oligonucleotides: An oligonucleotide is a single-stranded 
DNA or RNA molecule, typically prepared by synthetic 
means. Alternatively, naturally occurring oligonucleotides, 
or fragments thereof, may be isolated from their natural 
sources or purchased from commercial sources. Those oli- 
gonucleotides employed in the present invention will be 4 to 
100 nucleotides in length, preferably from 6 to 30 nucle- 
otides, although oligonucleotides of different length may be 
appropriaie. Suitable oligonucleotides may be prepared by 
the phospboramiditc method described by Bcaucage and 65 
Camithm. Tetrahedron UtL. 22:1859-1862 (1981). or by 
the triestcr method according to Matteucci, et al., J. Am 
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reaction involving the conversion of one or more 
rcactanis to one or more products. Such polypeptides 
generally include a binding site specific for at least one 
rcactant or reaction intermediate and an active func- 
tionality proximate to the binding site, which function- 5 
ality is capable of chemically modifying the bound 
rcactant. Catalytic polypeptides are described in, 
Lcrncr, R.A. ct a)., Science 252: 659 (1991). which is 
incorporated herein by reference. 
0 Hormone receptors: For instance, the receptors for 10 
insulin and growth hormone. Determination of the 
ligands which bind with high affinity to a receptor is 
useful in the development of. for example, an oral 
replacement of the daily injections which diabetics 
must take to relieve the symptoms of diabetes, and in * 5 
the other case, a replacement for the scarce human 
growth hormone that can only be obtained from cadav- 
ers or by recombinant DNA technology. Other 
examples arc the vasoconstrictive hormone receptors; 
determination of those ligands that bind to a receptor 20 
may lead to the development of drugs to control blood 
pressure. 

g) Opiate receptors: Determination of ligands that bind to 
the opiate receptors in the brain is useful in the devel- 
opment of less-addictive replacements for morphine 
and related drugs. 

Substrate or Solid Support: A material having a rigid or 
semi-rigid surface. Such materials will preferably take the 
form of plates or slides, small beads, pellets, disks or other 
convenient forms, although other forms may be used. In 
some embodiments, at least one surface of the substrate will 
be substantially fiat. In other embodiments, a roughly spheri- 
cal shape is preferred. 

Synthetic: Produced by in vitro chemical or enzymatic 
synthesis. The synthetic libraries of the present invention 
may be contrasted with those in viral or plasmid vectors, for 
instance, which may be propagated in bacterial, yeast, or 
other living hosts. 
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The broad concept of the present invention is illustrated in 
FIGS. IA lo IF. FIGS. 1A, IB and 1C illustrate the prepa- 
ration or surface-bound uni molecular double stranded DNA, 45 
while FIGS. ID. IE, and IF illustrate uses for the libraries 
of the present invention. 

FIG. 1 A shows a solid support 1 having an attached spacer 
2, which is optional. Attached to the distal end of the spacer 
is a first oligomer 3. which can be attached as a single uni: 50 
or synthesized on the support or spacer in a monomer by 
monomer approach. FIG. IB shows a subsequent siag: in 
the preparation of one member of a library according to the 
present invention. In this stage, a flexible linker 4 is attached 
10 the distal end of the oligomer 3. In other embodiments, the 55 
flexible linker will be a probe. FIG. 1C shows the completed 
surface-bound unimolccular double stranded DNA which is 
one member of a library, wherein a second oligomers is now 
attached to the distal end of the flexible linker (or probe). As 
shown in FIG. 1C, the length of the flexible linker (or probe) 60 
4 is sufficient such that the first and second oligomers (which 
arc complementary) exist in a double-stranded conforma- 
tion. It will be appreciated by one of skill in the art, that the 
libraries of the present invention will contain multiple, 
individually synthesized members which can be screened for 65 
various types of activity. Three such binding events are 
illustrated in FIGS. 1 D. IE and IF. 



In FIG. ID, 1 receptor 6, which can be a protein, RNA 
molecule or other molecule which is known to bind to DNA. 
is introduced to the library. Determining which member of 
a library binds to the receptor provides information which is 
useful for diagnosing diseases, sequencing DNA or RNA. 
identifying genetic charaoeristics. or in drug discovery. 

In FIG. IE the linker 4 is a probe for which binding 
information is sought. The probe is held in a cenformarion- 
ally restricted manner by the flanking oligomers 3 and 5. 
which arc present in a double- stranded conformation. As a 
result, a library of conformational^ restricted probes can be 
screened for binding activity with a receptor 7 which has 
specificity for the probe. 

The present invention also contemplates the preparation 
of libraries of uni molecular, double -stranded oligonucle- 
otides having bulges or loops in one of the strands as 
depicted in FIG. IF. In FIG. IF, one oligonucleotide 5 is 
shown as having a bulge 8. Specific RNA bulges arc often 
recognized by proteins (e.g., TAR RNA is recognized by the 
TAT protein of HTV). Accordingly, libraries of RNA bulges 
or loops are useful in a number of diagnostic applications. 
One of skill in the an will appreciate that the bulge or loop 
can be present in either oligonucleotide portion 3 or 5. 
Libraries of Uni molecular, Double-Stranded Oligonucle- 
otides . 
* In one aspect, the present invention provides libraries of 
unimolecular double -stranded oligonucleotides, each mem- 
ber of the library having the formula: 

Y— L'— X L 1 — X 1 

in which Y represents a solid support, X 1 and X 3 represent 
a pair of complementary oligonucleotides, L* represents a 
bond or a spacer, and L 5 represents a linking group having 
sufficient length such that X 1 and X J form a double-stranded 
oligonucleotide. 

The solid support may be biological, nonbiological, 
organic, inorganic, or a combination or any of these, existing 
as particles, strands, precipitates, gels, sheets, tubing, 
spheres, containers, capillaries, pads, slices, films, plates, 
slides, etc. The solid support is preferably flat but may take 
on alternative surface configurations. For example, the solid 
support may contain raised or depressed regions on which 
synthesis lakes place. In some embodiments, the solid 
support will be chosen to provide appropriate light-absorb- 
ing characteristics. For example, the support may be a 
polymerized Langmuir Blodgat film, functional ized glass. 
Si. Gc. GaAs, GaP, Si0 2 . SiN*. modified silicon, or any one 
of a variety of gels or polymers such as (poly)tctrafluoro- 
ethylene, (poly Jvinylidcndi fluoride, polystyrene, polycar- 
bonate, or combinations thereof. Other suitable solid support 
materials will be readily apparent to those of skill in the an. 
Preferably, the surface of the solid support will contain 
reactive groups, which could be carboxyl. amino, hydroxyl, 
thiol, or the like. More preferably, the surface will be 
optically transparent and wilt have surface Si— OH func- 
tionalities, such as arc found on silica surfaces. 

Attached to the solid support is an optional spacer, L\Thc 
spacer molecules are preferably of sufficient length to permit 
the double-stranded oligonucleotides in the completed mem- 
ber of the library to interact freely with molecules exposed 
to the library. The spacer molecules, when present, arc 
typical Iy6-50 atoms long to provide sufficient exposure for 
the attached double-stranded DNA molecule. The spacer, L l . 
is comprised of a surface attaching portion and a longer 
chain portion. The surface attaching portion is that part of L 
which is directly auached to the solid support. This portion 
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can be attached to the solid support via carbon-carbon bonds 
using, for example, supports having (poly)tn^uorochloro- 
ethylene surfaces, or preferably, by tiloxane bonds (using, 
for example, glass or silicon oxide as the solid support). 
Siloxane bonds with the surface of the support are formed in 
one embodiment via reactions of surface attaching portions 
bearing trichlorosilyl or trialkoxysilyl groups. The surface 
attaching groups will also have a site for attachment of the 

_ ■ ,«VUk am niiUni^ 



of the compounds of the in veotion. the linking group will be 
provided with functional groups which can be suitab y 
Urctcd or acti vaied. The linking group will be covdeafe 
iitached to each of the complementary ougonucleoudes.x 
Hid X 3 by means of an ether, ester, carbamate, phosphate 
utcr or'amine linkage. The flexible linking group L J will be 
tttached to the 5*-hydroxyl of the terminal monomer orx 
and to the 3 f -hydroxyl of the initial monomer of X . Pre- 



avc a site for attachment of the ffirTcd | intlgc$ are phosphate ester linkages which can be 
longer chain portion. For example, groups which are suitable f ^ . ^ same manner as the oligonDcleotide linkages 
for U to a longer chain portion would include » which ^ ^ in X 1 and X\ For example, htwifayl- 



is 



for attachment to a longer chain portion would include 10 which m j n X ar 

amines, hydroxy!, thiol, and carboxyl. Preferred surface ^glycol can be protected 
attaching portions include aminoalkylsilanes and hydroxy- |abile protecting group (i 
alkylsilanes. In particularly preferred cmbodimenls. the sur- 
face attaching portion of L l is either bis(2.hydroxyethyD- 
aminopropyltriethoxysilace. . 
2.hvdroxyeihylamjaopropyItris^^ 
ethoxysilane or bydroxypropyltrietboxysilane. ^ 

The longer chain portion can be any of a variety or 
molecules which are inert to the subsequent conditions for 
polymer synthesis. Tnese longer chain porticos wiU typi- 
cally be aryl acetylene, ethylene glycol oligomers containing 
2-14 monomer units, diamines, diacids, amino acids, pep- 
tides, or combinations thereof. In some embodiments, the 
longer chain portion is a polynucleotide. The longer chain 
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and X 2 . For example, hexaethyl- 

eneeivcoi c» ^ y * on one terminus with a pboto- 

S,Te7^tccting group (Le., NVOC or MeNPOQ and 
aciivaied on the other terminus with 2<yanoeayl-NJ^- 
misopropylamir^chlorophosphiie to form a phosphcramid- 
ile This Unking group can then be used for eonsnucuon or 
the Ubraries in the same manner as the photolabile-proiected. 
rhosphoramidite-activaied nucleotides. Alternatively, ester 
States to X 1 and X 3 can be formed when the L a has 
terminal carboxylic acid moieties (using the S'-hydroxyl of 
X* and the J-hydroxyl or X 3 ). Other methods of forming 
ether carbamate or amine linkages are known to those of 
skill in the an and particular reagents and references can be 
found in such texts as March. Advanced Organic Chtmxstry. 
4th Ed. WUey-lnterscience. New York. N.Y. 1992, jncor- 



lonee: chain portion is a poiynucieouoc. u~ «— - cc. T,m 7 -w~>~» — 

portion which is to be used as part of V can be selected » herein by «J«enee. lUachcd tt 

based upon its hydrophilicmydxophobic properties to -p* O ligonucieoude, X . which is covalenUy »«^~ ™ 

?«mw«T^ii« double.stranced oligonuclc- lhc tf5trf end of the Unking group » hke X^. a ,mgle- 

otid« to cc^in receptors, proteins or drugs. The longer strandcd DNA or RNA molecule. The <*^"*" 

«u ^nucleotides' alkylenc. polyalcoho!. polyester. » about 4 to about 100 nudcoudes in length. * " 

^J^SS^^ arid combinations thereof. „ olig0 nucleotide which is about 6 to about 30 nudeoudes 

^thesis of the libraries of the in leng* and ex^bits comply 

invention. L« will typically have a protecting group, attached 100% . Morc P«^* ™ 

to a functional group (i.e.. hydroxy I amino or carboxylic tary< ln onc group of embodiments other X .or a 

and) on ihedial or terminal end of the chain portion 33 ^ uprise a bulge or loop poruor .and "^itwmp^ 

(o^posTe Z solid support). After deprotection and cou- mcnlary 0 f from 90 to 100% over the remainder of the 

Pl ^^ °Ta^y preferred embodiment the JJjgi 

which iT. single-stranded DNA or RNA molecule. The u a silica support, the spacer » a 

onucleoudef wmch are part of the present invention « jugalcd „ an ^^^^^^S^ 

typically of from about 4 to about 100 nuckoudes in length. poJycthylencglycol group, and X «* * « ^ m P le ™ 

Preferably. X' is an oligonucleotide which is about 6 to ^ oligonucleotides each comprising of from 6 to 30 

about 30 nucleotides in length. The oligonucleotide is typi- nucleic acid monomers. nt 

«S linked uTl' via the l-hydroxyl group of the oligo- -p* library can have virtually any number of different 

Tu c S 45 will be limited only by f*"^"!^ 
Nation or an ether, ester, carbamate or phosphate ester 

Muchcd to the distal end of X' is a linking group, L 1 , gf0U / 0 r embrfirnemts. the ^^^^ 
which is Oexiblc and of sufficient length that X 1 can cITec- 100 mcmbcrs . | n other 8™? °^ m ^™ n *\ ™* 
tivdy hybridize with X'. The length of the linker will 50 wi|| ^ 100 and 10000 »d 
typically be a length which b at least the length spanned by 
two nucleotide monomers, and preferably at- least four 
nucleotide monomers, while not be so long as to interfere 
with cither the pairing of X 1 and X' or any subsequent 
assays. Tn: linking group itself will typically be an alkylene 3S 
group (of from about 6 to about 24 carbons in length), a 
polyethyleneglycol group (of from about 2 to about 24 
cthvleneRlycol monomers in a linear configurauon). a poly- 
alcohol Brcup. a polyamine group (e.g.. spermine, sperrm- mcmbcrJ 0 f the horary coroprwesj »u» u 7* 7" 
dte -and I ^ivmeric derivatives thereof), a polyester group 60 0?lioiul $pa cer which is attached to an oligomer of the 
(e r poly(ethyl acrylate) having of from 3 to 15 ethyl fonnula: 
acryiate monomers in a Uncar configuration) a Polyphos- 
phodieswr group, or a polynucleotide (having from about 2 -X -2-X 

10 about 12 nucleic acids). Preferably, the linking ftnnrpwui and X 12 arc complementary oligonucleotides 

be a polyethyleneglycol group which is atleast a tetroeih- 65 mjjen ^ ^ ^ lcngth Juch 

ylcncglycoi. and more preferably, from about 1 to 4 hcxa- ino *. (j k double-stranded DNA portion or 

ethylcneglycols linked in a linear array. For use in synthesis that X w 



avc octween IW anu 1 • — - 

10000 and 1000000 members, preferably on a solid support, 
in preferred cmbodimcnu. the library will have a density of 
more than 100 members at known locations per cm , prcl- 
eraWy more than 1000 per cm 1 , more preferably more than 
10000 per cm'. 

Ubraries of Conformationally Restricted Probes 

In sail another aspect, the present invenuon provides 
libraries of conformatiorially-resiricted probes. Each of the 
members of the library comprises a solid support having an 
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each member. X n and X 12 arc as described above for X 1 and 
X 1 respectively, except thai for the present aspect of the 
invention, each member of the probe library can have the 
same X 11 and the same X". and differ only in the probe 
portion. In one group of embodiments. X 11 and X" arc i 
cither a poly -A oligonucleotide or a poly-T oligonucleotide. 

As noted above, each member of the library will typically 
have a different probe portion. The probes, 2, can be any of 
a variety of structures for which receptor-probe binding 
information is sought for conformational ly -restricted forms. 10 
For example, the probe can be an agonist or antagonist for 
a cell membrane receptor, a toxin, venom, vital epitope, 
hormone, peptide, enzyme collector, drug, protein cr anti- 
body. In one group of embodiments, the probes are different 
peptides, each having of from about 4 to about 12 amino 15 
acids. Preferably the probes will be linked via polyphos- 
phate dicsters, although other linkages arc also suitable. For 
example, the last monomer employed on the X" chain can 
be a 5'-aminopropyl-funaionaliird phosphoramidiie nude- 
ctidc (available from Glen Research, Sterling, Va,, USA or 20 
Gcnosys Biotechnologies, The Woodlands, Tex., USA) 
which will provide a synthesis initiation site for the carboxy 
to amino synthesis of the peptide probe. Once the peptide 
probe is formed, a 3'-succinylaied nucleoside (from Cm- 
achem. Sterling, Va, USA) will be added under peptide 25 
coupling conditions. In yet another group. of ernbodirocms. 
the probes will be oligonucleotides of from 4 to about 30 
nucleic acid monomers which will form a DNA or RNA 
hairpin structure. For use in synthesis, the probes can also 
have associated functional groups (i.e., hydroxyl, amino. 50 
carboxylic acid, anhydride and derivatives thereof) .for 
attaching two positions on the probe to each of the comple- 
mentary oligonucleotides. 

The surface of the solid support is preferably provided 
with a spacer molecule, although it will be understood that 35 
the spacer molecules arc not elements of this aspect of the 
invention. Where present, the spacer molecules will be as 
described above for L 1 . 

The libraries of conformationally restricted probes can 
also have virtually any number of members. As above, the « 
number or members will be limited only by design of the 
panicular screening assay for which the library will be used, 
and by the synthetic capabilities of the practitioner. In one 
group orembcidimcnis. the library will have from 2 to 100 
members. In other groups of embodiments, the library will 45 
have between 100 and 10000 members, and between 10000 
and 1000000 members. Also as above, in preferred embodi- 
ment*, the library will have a density of more than 100 
members at known locations per cm 7 , preferably more than 
1000 per cm 1 , more preferably more than 10,000 per cm 2 . 50 
Preparation of the Libraries 

The present invention further provides methods for the 
prcparaiion of diverse unimolccular, double-stranded oligo- 
nucleotides on a solid support. In one group of embodi- 
ments, the surface of a solid support has a plurality of 53 
preselected regions. An oligonucleotide of from 6 to 30 
monomers is formed on each of the preselected regions. A 
linking group is then attached to the distal end of each of the 
oligonucleotides. Finally, a second oligonucleotide is 
formed on the distal end of each linking group such that the 60 
second oligonucleotide is complementary lo the oligonucle- 
otide already present in the same preselected region. The 
linking group used will have sufficient length such thai the 
complementary oligonucleotides form a unimolccular, 
double-stranded oligonucleotide. In another group of 65 
embodiments, each chemically distinct member of the 
library wilt be synthesized on a separate solid support 



12 

libraries on a Single Substrate 
Ughi-Dirccted Methods 

For those embodiments using a single solid support, the 
Oligonucleotides of the present invention can be formed 
using a variety of techniques known to those skilled in the 
art of polymer synthesis on solid supports. For example, 
"fight directed'* methods (which are one technique in a 
family of methods known as VLStPS™ methods) arc 
described in U.S. PaL No. 5,143.854, previously incorpo- 
rated by reference. The light directed methods discussed in 
the '854 patent involve activating predefined regions of a 
substrate or solid support and then contacting the substrate 
with a preselected monomer solution. The predefined 
regions can be activated with a light source, typically shown 
through a mask (much in the manner of photolithography 
techniques used in integrated circuit fabrication). Other 
regions of the substrate remain inactive because they arc 
blocked by the mask from illumination and remain chemi- 
cally protected Thus, a light pattern defines which regions 
of the substrate react with a given monomer. By repeatedly 
activating different sets or predefined regions and contacting 
different monomer solutions with the substrate, a diverse 
array of polymers is produced on the substrate. Of course, 
other steps such as washing unrcacicd monomer solution 
from the substrate can be used as necessary. Other tech- 
niques include mechanical techniques such -as those 
described in PCT No. 92/10183. U.S. Pat. No. 5.384,261 
also incorporated herein by reference for all purposes. Still 
further techniques include bead based techniques such as 
those described in PCT US/93/04145, also incorporated 
herein by reference, and pin based methods such as those 
described in U.S. PaL No. 5.288.514. also incorporated 
herein by reference. 

The VLSIPS™ methods arc preferred for making the 
compounds and libraries of the present invention. The 
surface of a solid support, optionally modified with spacers 
having photolabilc protecting groups such as NVOC and 
McNPOC, is illuminated through a photolithographic mask, 
yielding reactive groups (typically hydroxyl groups) in the 
illuminaxd regions. A 3*-0-phosphoramidi:c activated 
dcoxynuclcosidc (protected at the 5'-hydroxyl with a pho- 
tolabilc protecting group) is then presented to the surface 
and chemicai coupling occurs at sites that were exposed to 
light. Following capping, and oxidation, the substrate is 
rinsed and the surface illuminated through a second mask, to 
expose additional hydroxyl groups for coupling. A second 
^-protected. 3'-0-phosphoramiditc activated dcoxynudco- 
sidc is presented to the surface. The selective photodepro- 
tcction and coupling cycles arc repeated until the desired set 
of oligonucleotides is produced. Alternatively, an oligomer 
of from, for example. 4 to 30 nucleotides can be added to 
each of the preselected regions rather than synthcsiic each 
member , in a monomer by monomer approach. At this point 
in the synthesis, either a flexible linking group or a probe can 
be auached in a similar manner. For example, a flexible 
linking group such as polyethylene glycol will typically 
having an activating group (i.e.. a phosphoramiditc) on one 
end and a photolabilc protecting group attached to the other 
end. Suitably derivatized polyethylene glycol linking groups 
can be prepared by the methods described in Durand, ct al. 
Nucleic Acids Res. 18:6353-6359 (1990). Briefly, a poly- 
ethylene glycol (t.e., hcxacthylenc glycol) can be mono- 
protected using MeNPOC-chloridc. Following purification 
of the mono-protected glycol, the remaining hydroxy moiety 
can be activated with 2<yanocthyl-N f N-diisopropylami- 
nccWofophosphite. Once the flexible linking group has been 
attached to the first oligonucleotide (X 1 ). dcpTOtcction and 
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coupling cycles will proceed using 5 '-protected, 3'-0?phos- 
phorarr.idite activated deoxynucleosides or intact oligomers. 
Probes can be attached in a manner similar to thai used for 
the flexible linking group. When the desired probe is itself 
an oligomer, it can be formed other in stepwise fashion oc J 
the immobilized oligonucleotide or it can be separately 
synthesized and coupled to the immobilized oligomer in a 
single step. For example, preparation of confonnationally 
restricted p-tura mimetics will typically involve synthesis of 
an oligonucleotide as described above, in which the last 10 
nucleoside monomer will be deri vatized with an aminoallcyl- 
funciionalized phosphoramidite. See, U.S. Pat No. 5,288, 
514, previously incorporated by reference. The desired 
peptide probe is typically formed in the direction from 
carboxyl to amine terminus. Subsequent coupling of a 15 
3*-sucrinylated nucleoside, for example, provides the firs! 
monomer in the construction of the complementary oligo- 
nucleotide strand (which is carried out by the above meth- 
ods). Alternatively, a library of probes can be prepared by 
first derivatizing a solid support with multiple poly(A) or » 
polyfT) oligonucleotides which arc suitably protected with 
photolabile protecting groups, deproiecting a; known sites 
and constructing the probe at those sites, then coupling the 
complementary polyCO or poly(A) oligonucleotide. 

Bow Channel or Spotting Methods 15 

Additional methods applicable to library synthesis on a 
sinele substrate arc described in co-pending applications 
Ssr. No. 07/980,523, filed Nov. 20. 1992, and U.S. Pat. No. 
5,384.261 .incorporated herein by reference for all purposes. 
In the methods disclosed in these applications, reagents are 30 
delivered to the substrate by either CD flowing within a 
channel defined on predefined regions or (2) "spotting" on 
predefined regions. However, other approaches, as well as 
combinations of spotting and flowing, may be employed. Jn 
each instance, certain activated regions of the substrate are 33 
mechanically separated from other regions when the mono- 
mer solutions arc delivered to the various reaction sites. 

A typical "flow channel" method applied to the com- 
pounds and libraries of the present invention can generally 
be described as follows. Diverse polymer sequences are 40 
synthesized at selected regions of a substrate or solid support 
by forming flow channels on a surface of the substrate 
through which appropriate reagents flow or in which appro- 
priate reagents arc placed. For example, assume a monomer 
"A*" is to be bound to the substrate in a first group of selected as 
regions. If necessary, all or part of the surface of the 
substrate in all or a part of the selected regions is activated 
for binding by. for example, flowing appropriate reagents 
through all or some of the channels, or by washing the entire 
substrate with appropriate reagents. After placement of a 50 
channel block on the surface of the substrate, a reagent 
having the monomer A flows through or is placed in all or 
some of the channel(s). The channels provide fluid contact 
to the first selected regions, thereby binding the monomer A 
on the substrate directly or indirectly (via a spacer) in the 55 
first selected regions. 

Thereafter, a monomer B is coupled to second selected 
regions, some of which may be included among the first 
selected regions. The second selected regions will be in fluid 
contact with a second flow channel (s) through translation, 60 
rotation, or replacement of the channel block on the surface 
of the substraie; through opening or closing a selected valve; 
or through deposition of a layer of chemical or photoresist 
If necessary, a step is performed for activating at least the 
second regions. Thereafter, the monomer B is Bowed 63 
through or placed in the second flow channel(s), binding 
monomer B at the second selected locations. In this particu- 



lar example, the resulting sequences bound to the substrate 
at this stage of processing will be, for example, A, B, and 
AB. The process is repeated to form a vast array of 
sequences of desired length at known locations on the 
substraie. 

After the substrate is activated, monomer A can be flowed 
through some of the channels, monomer B can be flowed 
through other channels, a monomer C can be flowed through 
still other channels, etc. In this manner, many or all of the 
reaction regions arc reacted with a monomer before the 
channel block must be moved or the substrate must be 
washed and/or reactivated. By making use of many or all of 
the available reaction regions simultaneously, the number of 
washing and activation steps can be minimized. 

One of skill in the art will recognize that there are 
alternative methods of forming channels or otherwise pro- 
tecting a portion of the surface of the substraie. For example, 
according to some embodiments, a protective coating such 
as a hydrophaic or hydrophobic coating (depending upon 
the nature of the solvent) is utilized over portions of the 
subsirate to be protected, sometimes in combination with 
materials that facilitate wetting by the reactant solution in 
other regions. Jn this manner, the flowing solutions are 
further prevented from passing outside of their designated 
flow paths. 

The "spotting" methods of preparing compounds and 
libraries of the present invention can be implemented in 
much the same manner as the flow channel methods. For 
example, a monomer A can be delivered to and coupled with 
a firs; group of reaction regions which have been appropri- 
ately activated. Thereafter, a monomer B can be delivered to 
and reacted with a second group of activated reaction 
regions. Unlike the flow channel embodiments described 
above, reactants are delivered by directly depositing (rather 
than flowing) relatively small quantities of them in selected 
regions. In some steps, of course, the entire substraie surface 
can be sprayed or otherwise coated with a solution. In 
preferred embodiments, a dispenser moves from region to 
region, depositing only as much monomer as necessary at 
each stop. Typical dispensers include a micropipctic to 
deliver the monomer solution to the subsirate and a robotic 
system to control the position of the micropipette with 
respect to the substrate, or an ink -jet printer. In other 
embodiments, the dispenser includes a scries of tubes, a 
manifold, an array of pipettes, or the like so thai various 
reagents can be delivered to the reaction regions simulta- 
neously. 

Pin-Based Methods 

Another method which is useful for the preparation of 
compounds and libraries of the present invention involves 
"pin based synthesis." This method is described in detail in 
U.S. Pat. No. 5.288.514. previously incorporated herein by 
reference. The method utilizes a substrate having a plurality 
of pins o: other extensions. The pins arc each inserted 
simultaneously into individual reagent containers in a tray. 
In a common embodiment, an array of 96 pins/containers is 
utilized. 

Each tray is filled with a particular reagent for coupling in 
a particular chemical reaction on an individual pin. Accord- 
ingly, the trays will often contain different reagents. Since 
the chemistry disclosed herein has been established such that 
a relatively similar set of reaction conditions may be utilized 
to perform each of the reactions, it becomes possible to 
conduct multiple chemical coupling steps simultaneously. In 
the first step of the process the invention provides for the use 
of substrates) on which the chemical coupling steps are 
conducted. The subsirate is optionally provided with a 
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spacer having active sites. In ihe particular case of oligo- 
nucleotides, for example. Ihe spacer may be selected from a 
wide variety of molecules which can be used in organic 
environments associated with synthesis as well as aqueous 
environments associated with binding studies. Examples of s 
suitable spacers are polyethylcocglyools, dicarboxylic acids, 
polyamines and alkylcnes, substituted with, for example, 
mcthoxy and cthoxy groups. Additionally, the spacers will 
have an active site on the distal end. The active sites arc 
optionally protected initially by protecting groups. Among a 10 
wide variety of protecting groups which are useful are 
FMOC BOC, t-butyl esters, i-butyl ethers, and the like. 
Various exemplary protecting groups are described in, for 
example, Atherton el al.. Solid Phase Peptide Synthesis, 1RL 
Press (1989), incorporated herein by reference. In some 13 
embodiments, the spacer may provide for a cleavable func- 
tion by way of. for example, exposure to acid or base. 
Libraries on Multiple Substrates 
Bead Based Methods 

Ycl another method which is useful for synthesis of 20 
compounds and libraries of the present invention involves 
"bead based synthesis." A general approach for bead based 
synthesis is described copending application Set. Nos. 
07/762^22 (filed Sep. 18, 1991 now abandoned); 07/946. 
239 (filed Sep. 16, 1992); 06/146,886 (filed Nov. 2, 1993); 25 
07/876.792 (filed Apr. 29. 1992) and PCT/US93/04U5 
(filed Apr. 28. 1993). the disclosures of which are incorpo- 
rated herein by reference. 

For the synthesis of molecules such as oligonucleotides 
on beads, a large plurality of beads ajc suspended in a 30 
suitable carrier (such as water) in a container. The beads arc 
provided with optional spacer molecules having an active 
site. The active site is protected by an optional protecting 
group. 

In a first step of the synthesis, the beads arc divided for 33 
coupling into a plurality of containers. For the purposes of 
this brief description, the number of containers will be 
limited to three, and the monomers denoted as A. B. C. D. 
E. and F. The protecting groups arc then removed and a first 
portion of the molecule to be synthesized is added to each of *c 
the three containers (i. c.. A is added to container 1, B is 
added to container 2 and C is added to container 3). 

Thereafter, the various beads arc appropriately washed of 
excess reagents, and remixed in one container. Again, it will 
be recognized that by virtue of the large number or beads «3 
utilized at the outset, there will similarly be a large number 
of beads randomly dispersed in the container, each having a 
particular first portion of the monomer to be synthesized on 
a surface thereof. 

Thereafter, the various beads arc again divided for cou- 5C 
pling in another group of three containers. The beads in the 
first container arc dcprotcctcd and exposed to a second 
monomer (D), while the beads in the second and third 
containers arc coupled to molecule portions £ and F respec- 
tively. Accordingly, molecules AD. BD. and CD will be 33 
preseni in the first container, while AE, BE, and CE will be 
present in the second container, and molecules AF, BF, and 
CF will be preseni in the third container. Each bead, how- 
ever, will have only a single type of molecule on its surface. 
Thus, all of the possible molecules formed from Ihe first 60 
portions A, B. C. and the second portions D, E, and F have 
been formed. 

The beads arc then rccombincd into one container and 
additional steps such as are conducted to complete (he 
synthesis of the polymer molecules. In a preferred cmbodi- 63 
menu the beads are tagged with an identifying tag which is 
unique to the particular double-stranded oligonucleotide or 



probe which is present on each bead. A complete description 
of identifier tags for use in synthetic libraries is provided in 
co-pending application Ser. No. 08/146,886 (filed Nov. 2. 
1993) previously incorporated by reference for all purposes. 
Methods of Ubrary Screening 

A library prepared according to any of the methods 
described above can be used to screen for receptors having 
high affinity for either urumolccular. double- stranded oligo- 
nucleotides or cortformanooally restricted probes. In one 
group of embodiments, a solution containing a marked 
(labelled) receptor is introduced to the library and incubated 
for a suitable period of rime. The library is then washed free 
of unbound receptor and the probes or double-stranded 
oligonucleotides having high affinity for ihe receptor arc 
identified by identifying those regions on the surface of the 
library where markers arc located. Suitable markers include, 
but are not limited to. radiolabels. chromophores. fluoro- 
phores, cherniluminesccni moieties, and transition metals. 
Alternatively, the presence of receptors may be detected 
using a variety of other techniques, such as an assay with o 
labelled enzyme, antibody, and the like. Other techniques 
using various marker systems for detecting bound receptor 
will be readily apparent to those skilled in the art 

In a preferred embodiment, a library prepared on a single 
solid support (using, for example, the VLSIPS™ technique) 
can be exposed to a solution containing marked receptor 
such as a marked antibody. The receptor can be marked in 
any of a variety of ways, but in one embodiment marking is 
effected with a radioactive label. The marked antibody binds 
with high affinity to an immobilized antigen previously . 
localized on the surface. After washing the surface free of 
unbound receptor, the surface is placed proximate to x-ray 
film or phosphorimagcrs to identify the antigens that arc 
recognized by the antibody. Alternatively, a fluorescent 
marker may be provided and detection may be by way of a 
charge-coupled device (CCD), fluorescence microscopy or 

laser scanning. 

When autoradiography is the detection method used, the 
marker is a radioactive label, such as M P. The marker on the 
surface is exposed to X-ray film or a phosphorimagcr, which 
is developed and read out on a. scanner. An exposure time of 
about I hour is typical in one embodiment. Fluorescence 
detection using a (luorophorc label, such is fluorescein, 
attached to the receptor will usually require shorter exposure 
times. 

Quantitative assays for receptor concentrations can also 
be performed according to the present invention. In a direct 
assay method, the surface containing localized probes pre- 
pared as described above, is incubated with a solution 
containing a marked rcecptor for a suitable period of time. 
The surface is then washed free of unbound receptor. TTic 
amount of marker present at predefined regions of the 
surface is then measured and can be related to the amount of 
receptor in solution. Methods and conditions for performing 
such assays arc well-known and arc presented in. for 
example. L. Hood ct al.. Immunology, Bcnjarain/Cummings 
(1978). and E. Harlow el al.. Antibodies. A Laboratory 
Manual, Cold Spring Harbor Laboratory. (1988). Sec, also 
U.S. Pat. No. 4376,110 for methods of performing sandwich 
assays. The precise conditions for performing these steps 
will be apparent to one skilled in the an. 

A competitive assay method for two receptors can also be 
employed using the preseni invention. Methods of conduct- 
ing competitive assays arc known to those of skill in the an. 
One such method involves immobilizing confonnationally 
restricted probes on predefined regions of a surface as 
described above. An unmarked first receptor is then bound 
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to the probes on the surface having a known sped lie binding 
affinity for the receptors. A solution containing a marked 
second receptor is then introduced to the surface and incu- 
bated for a suitable time. The surface is then washed free of 
unbound reagents and the amount of marker remaining on s 
the surface is measured. In another form of competition 
assay, marked and unmarked receptors can be exposed to the 
surface -simultaneously. The amount of marker remaining on 
predefined regions of the surface can be related to the 
amount of unknown receptor in solution. Yet another form of 10 
competition assay will utilize two receptors having different 
labels, for example, two different chromophores. 

Jn other embodiments, in order to detect receptor binding, 
the double-stranded oligonucleotides which are formed with 
attached probes or with a flexible linking group will be 15 
treated with an intercalating dye, preferably i fluorescent 
dye. The library can be scanned to establish a background 
fluorescence. After exposure of the library to a receptor 
solution, the exposed library will be scanned or illuminated 
and examined for those areas in which fluorescence has 20 
changed. Alternatively, the receptor of interest can be 
labeled with a fluorescent dye by methods known to those of 
skill in the art and incubated with the library of probes. The 
library can then be scanned or illuminated, as above, and 
examined for areas of fluorescence. 23 

In instances where the libraries axe synthesized on beads 
in a number of containers, the beads are exposed to a 
receptor of interest In a preferred embodiment the receptor 
is fluorescently or radioactive! y labelled. Thereafter, one or 
more beads are id en lifted that exhibit significant levels of, 30 
for example, fluorescence using one of a variety of tech- 
niques. For example, in one embodiment, mechanical sepa- 
ration under a microscope is utilized. The identity of the 
molecule on the surface of such separated beads is then 
identified using, for example, NMR, mass spectrometry. 33 
PCR amplification and sequencing or the associated DNA. 
or the like. In another embodiment, automated sorting (i.e.. 
fluorescence activated cell soiling) can be used to separate 
beads (bearing probes) which bind to receptors from those 
which do not bind. Typically the beads wilt be labeled and *o 
identified by methods disclosed in Ncedcls, et al., Proc. 
Natl Acad ScL USA 90:10700-10704 (1993), incorporated 
herein by reference. 

The assay methods described above for the libraries of the 
present invention will have tremendous application in such 45 
endeavors as DNA "footpriming" of proteins which bind 
DNA. Currently, DNA footprinting is conducted using 
DNasc t digestion of double-stranded DNA in the presence 
of a putative DNA binding protein. Gel analysis of cut and 
protected DNA fragments then provides a "footprint" of 50 
where the protein contacts the DNA. This method is both 
labor and time intensive See, Galas et al. Nucleic Acid Res. 
5:3 157 (1978). Using the above methods, a "footprint" could 
be produced using a single array of unimolccula.% double - 
stranded oligonucleotides in a fraction of the lime of con- 33 
vcntional methods. Typically, the protein will be labeled 
with a radioactive or fluoresceni species and incubated with 
a library of unimolecular, double-stranded DNA. Pnospho- 
rimaging or fluor escen ce detection will provide a footprint 
of those regions 00 the library where the protein has bound. 60 
Alternatively, unlabeled protein can be used. When unla- 
beled protein is used, the double- stranded oligonucleotides 
in the library will all be labeled with a marker, typically a 
fluorescent marker. Incorporation of a marker into each 
member of the library can be carried out by terminating the 63 
oligonucleotide synthesis with a corruTiercially available 
fluorescing pbospboramidi'x nucleotide derivative. Follow- 



ing incubation with the unlabeled protein, the library will be 
treated with DNase I and examined for areas which arc 
protected from cleavage. 

The assay methods described above for the libraries of the 
present invention can also be used in reverse drug discovery. 
In such an application, a cornpocod having known pharma- 
cological safety or other desired properties (e.g., aspirin) 
could be screeoed against a variety of double-stranded 
oligonucleotides for potential binding. If the compound is 
shown to bind to a sequence atsociairri with, for example, 
tumor suppression, the compound can be further examined 
for cScacy in the related diseases. 

In other embodiments, probe arrays comprising p-tum 
mime ucs can be prepared and assayed for activity against a 
particular receptor, p-cura mimeiics are compounds having 
molecular structures similar to turns which are one of the 
three major components in protein molecular arcrmecture, 
p-rums are similar in concept to hairpin aims of oligonucle- 
otide strands, and ere often critical recognition features for 
various protein -Ugand and protein-protein Interactions. As a 
result, a library of fJ-tum mimetic probes can provide or 
suggest oew therapeutic agents having a particular affinity 
fcr a receptor which will correspond to the affinity exhibited 
by the $-tum and its receptor. 
Bioelcctronic Devices and Methods 

In another aspect, the present invention provides a method 
for the bioelcctronic detection of sequence-specific oligo- 
nucleotide hybridization. A general method and device 
which is useful in diagnostics in which a biochemical 
species is attached to the surface of a sensor is described in 
U.S. Pat. No. 4,562.157 (the Lowe patent), incorporated 
herein by reference. The present method utilizes arrays of 
immobilized oligonucleotides (prepared, for example, using 
VLSIPS™ technology) and the known photo-induced elec- 
tron transfer which is mediated by a DNA double helix 
structure. See. Murphy et al., Science 262:1025-1029 
(1993). This method is useful in hybridizalionbasod diag- 
nostics, as a replacement for fluorescence -based detection 
systems. The method of bioelcctronic detection also offers 
higher resolution and potentially higher sensitivity than 
earlier diagnostic methods involving sequencing/detecting 
by hybridization. As a result, this method finds applications 
in genetic mutation screening and primary sequencing of 
oligonucleotides. The method can also be used for Sequenc- 
ing By Hybridization (SBH), which is described in co- 
pending application Scr. Nos. 08/082,937 (filed Jun. 25. 
1993 now abandoned) and 08/168,904 (filed Dec. 15, 1993), 
each or which arc incorporated herein by reference for all 
purposes. This method uses a set of short oligonucleotide 
probes of defined sequence to search for complementary 
sequences on a longer target strand of DNA. The hybrid- 
ization pattern is used to reconstruct the target DNA 
sequence. Thus, the hybridization analysis of large numbers 
of probes can be used 10 sequence long stretches of DNA. in 
immediate applications of this hybridization methodology, a 
small number of probes can be used to interrogate local 
DNA sequence. 

In the present inventive method, hybridization is moni- 
tored using bioclecironic detection. In this method, the target 
DNA. or first oligonucleotide, is provided with an electron- 
donor tag and then incubated with an array of oligonucle- 
otide probes, each of which bears an electron-acceptor tag 
and. occupies a known position on the surface of the array. 
After hybridization of the first oligonucleotide to the amy 
has occurred, the hybridized array is illuminated to induce 
an electron transfer reaction in the direction of the surface of 
the array. The electron transfer reaction is then detected at 
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the location on the surface where hybridization has taken 
place. Typically, each of the oligonucleotide probes in an 
amy wilt have an attached electron -acceptor tag located 
near the surface of the solid support used in preparation of 
the array. In embodiments tn which the arrays are prepared 5 
by light-directed methods (i.e. typically 3* to 5* direction), 
the electronacccptor tag will be located near the 3' position. 
The electron-accepter tag can be auached either to the 3' 
monomer by methods known to those or skill in the an. or 
it can be attached to a spacing group between the 3' 10 
monomer and the solid support Such a spacing group will 
have, in addition to functional groups for auachment to the 
solid support and the oligonucleotide, a third functional 
group for attachment of the elcctronacccpior tag. The target 
oligonucleotide will typically have the electron-donor tag 15 
attached at the 3* position. Alternatively, the target oligo- 
nucleotide can be incubated with the array in the absence of 
an electron-donor tag. Following incubation, the electron- 
donor tag can be added in solution. The electron-donor tag 
wilt then intercalate into those regions where hybridization 30 
has occurred. An electron transfer reaction can then be 
detected in those regions having a continuous DNA'doublc 
helix. 

The electron-donor lag can be any of a variety of com- 
plexes which participate in electron transfer reactions and 25 
which can be auached to an oligonucleotide by a means - 
which docs not interfere with the electron transfer reaction. 
In preferred embodiments, the electron-donor lag is a ruthe- 
nium (II) complex, more preferably a ruthenium (11) 
(phcn') 2 (dpp7.) complex. 30 

The electron-acceptor tag can be any species which, with 
the electron -donor tag, wilt participate in an electron transfer 
reaction. An example of an electron-acceptor tag is a 
rhodium (III) complex. A preferred electron -acceptor tag is 
a rhodium (111) (phi) 2 (phcn*) complex. 35 

In a particularly preferred embodiment, the electron- 
donor tag is a ruthenium (II) (phcn') 3 (dppz) complex and the 
electron- acceptor tag is a rhodium (III) (phi) 2 (phcn') com- 
plex. 

In still another aspect, the present invention provides a 40 
device for the bioclcoronic detection of sequence-specific 
oligonucleotide hybridization. The device will typically con- 
sist of a sensor having a surface to which an array of 
oligonucleotides arc attached. The oligonucleotides will be 
attached in prc-dcfincd areas on the surface of the sensor and 45 
have an electron-acceptor tag auached to each oligonucle- 
otide. The electron- acceptor tag will be a tag which is 
capable of producing an electron transfer signal upon illu- 
mination of a hybridized species, when the complementary 
oligonucleotide bears an elocuondonaiing tag. The signal 50 
will be in the direction of the sensor surface and be detected 
by the sensor. 

In a preferred embodiment, the sensor surface will be a 
silicon -based surface which can sense the electronic. signal 
induced and. if necessary, amplify the signal. The metal 55 
contacts on which the probes will be synthesized can be 
treated with an oxygen plasma prior to synthesis of the 
probes to enhance the silane adhesion and concentration on 
the surface The surface wOl further comprise a multi-gated 
field effect transistor, with each gate serving as a sensor and 60 
different oligonucleotides auached to each gate. The oligo- 
nucleotides will typically be attached to the metal contacts 
on the sensor surface by means of a spacer group. 

The spacer group should not be too long, in order to 
ensure that the sensing function of the device is easily 65 
activated by the binding interaction and subsequent illumi- 
nation of the "tagged" hybridized oligonucleotides. Prefer- 



ably, the spacer group is from 3 to 1 2 atoms in length and 
will be as described above for the surface modifying portion 
of the spacer group. L'. 

The oligonucleotides which are attached to the spacer 
group can be formed by any or the solid phase techniques 
which are known to those of skill in the art Preferably, the 
oligonucleotides are formed one base at a time in the 
direction of the 3* terminus to the 5* terminus by the 
"light-directed" methods described above. The oligonucle- 
otide can then be modified at the 3' end to attach the 
electron-acceptor tag. A number of suitable methods of 
attachment are known. For example, oiodification with the 
reagent Aminolinx2 (from Applied Biosys terns. !r.c.) pro- 
vides a terminal phosphate moiety which is derivatized with 
an aminohcxyl phosphate ester. Coupling of a carboxylic 
acid, which is present on the electron-acceptor tag. to the 
amine can then be carried out using HOBT and DCC. 
Alternatively, synthesis of the oligonucleotide can begin 
with a suitably derivatized and protected monomer which 
can then be dcprotccicd and coupled to the electron-acceptor 
tag once the complete oligonucleotide has been synthesized. 

The silica surface can also be replaced by silicon nitride 
or oxynitridc, or by an oxide of another metal, especially 
aluminum, titanium (IV) or iron (III). The surface can also 
be any other film, membrane, insulator or semiconductor 
overlying the sensor which will not interfere with the 
detection of electron transfer detection and to which ar. 
oligonucleotide can be coupled. 

Additionally, detection devices other than an FET can be 
. used. For example, sensors such as bipolar transistors, MOS 
transistors and the like arc also useful for the detection of 
electron transfer signals. 
Adhesives 

In still another aspect, the present invention provides an 
adhesive comprising a pair of surfaces, each having a 
plurality of attached oligonucleotides, wherein the singic- 
stranded oligonucleotides on one surface arc complementary 
to the single-stranded oligonucleotides on the other surface. 
The sucngth and position/orientation specificity can be 
controlled using a number of factors including the number 
and length of oligonucleotides on each surface, the degree of 
complementary, and the spatial arrangement of complemen- 
tary oligonucleotides on the surface. For example, increas- 
ing the number and length of the oligonucleotides on each 
surface will provide a stronger adhesive. Suitable lengths of 
oligonucleotides arc typically from about 10 to about 70 
nucleotides. Additionally, the surfaces of oligonucleotides 
can be prepared such that adhesion occurs in on extremely 
rjration-specific manner by a suitable arrangement of 
complementary oligonucleotides in a specific pattern. Small 
deviations from the optimum spatial arrangement arc ener- 
getically unfavorable as many hybridization bonds must be 
broken and arc not reformed in any other relative orienta- 
tion. 

The adhesives of the present invention will find use in 
numerous applications. Generally, the adhesives are useful 
for adhering two surfaces to one another. More specifically, 
the adhesives will find application where biological com- 
patibility of the adhesive is desired. An example of a 
biological application involves use in surgical procedures 
where tissues must be held in fixed positions during or 
following the procedure. In this application, the surfaces of 
the adhesive will typically be membranes which arc com- 
patible with the tissues to which they arc attached. 

A particular advantage of the adhesives of the present 
invention is that when they arc formed in an orientation 
specific manner, the adhesive portions will be "self-finding." 
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thai is the system will go to the thermodynamic equilibrium 
in which the two aides are matched in the predetermined, 
orientation specific manner. 

EXAMPLES 3 

Example 1 

This example illustrates the general synthesis of an array 
of uni molecular, double-stranded oligonucleotides on a solid JQ 
support 

Unimolecular double stranded DNA molecules were syn- 
thesized on a solid support using standard light-directed 
methods (VLSIPS™ protocols). Two hexacthylcnc glycol 
(PEG) linkers were used to covalently attach the synthesized 13 
oligonucleotides to the derivatized glass surface. Synthesis 
of the first (inner) strand proceeded one nucleotide at a time 
using repealed cycles of photo-deprotection and chemical 
coupling of protected nucleotides. The nucleotides each had 
a protecting group on the base portion of the monomer as a 
well as a photolabilc MeNPoc protecting group on the 5' 
hydroxy-. Upon completion of the inner strand, another 
MeNPoc -protected PEG linker was covalently attached to 
the 5' end of the surface-bound oligonucleotide. After addi- 
tion of the internal PEG linker, the PEG is photodeprotected, 23 
and the synthesis of the second strand proceeded in the 
normal fashion. Following the synthesis cycles, the DNA 
bases were deprotecied using standard protocols. The 
sequence of the second (outer) strand, being complementary 
to thai of the inner strand, provided molecules with short, 
hydrogen bonded, unimolecular double-stranded structure 
as a result of the presence of the internal flexible PEG linker. 

An array of 16 different molecules were synthesized on a 
derivatized glass slide in order to determine whether short, 
unimolecular DNA structures could be formed on a surface 35 
and whether they could adopt structures that arc recognized 
by proteins. Each of the 16 different molecular species 
occupies a different physical region on the glass surface so 
that there is a one-to-one correspondence between molecular 
identity and physical location. The molecules are of the form *0 

S-P-P-C-C-A/T-AyT-A/T-A/T-G-C-P-G-C-A/T.AA"-A/T- 
A/T-G-G-F 

where S is the solid surface having silyl groups, P is a PEG 
li nxer. A. C. G. and T are the DNA nucleotides, and F is a 
fluorescent lag. Th: DNA sequence is listed from the 3' to 43 
the 5' end (the 3* end of the DNA molecule is attached to the 
solid surface via a silyl group and 2 PEG linkers). The 
sixteen molecules synthesized on the solid support differed 
in the various permutations of A and T in the above formula. 

50 

Example 2 

This example illustrates the ability of a library of surface- 
bound, unimolecular. double-stranded oligonucleotides to 
exist in duplex form and to be recognized and bound by a 35 
protein. 

A library of 16 different members was prepared as 
described in Example 1. The 16 molecules all have the same 
composition (same number of As, Cs, Gs and Ts), but the 
order is different. Four of the molecules have an outer strand 60 
that is 100% complementary to the inner strand (these 
molecules will be referred to as DS. doublestranded, below). 
One of the four DS oligonucleotides has a sequence that is 
recognized by the restriction enzyme EcoRl. If the molecule 
can loop back and form a DNA duplex, it should be 63 
recognized end cut by the restriction enzyme, thereby releas- 
ing the fluorescent tag. Thus, the action of the enzyme 



provided a functional test for DNA structure, and also served 
to demonstrate that these structures can be recognized at the 
surface by proteins. The remaining 1 2 molecules bad outer 
strands thai were not complementary to their inner strands 
(referred to as SS, single-stranded, below). Of these, three 
had an outer strand and three had an inner strand whose 
sequence was an EcoRl half-site (the sequence on one 
strand was correct for the enzyme, but the other half was 
not). The solid support with an array of molecules on the 
surface is referred to as a "chip" for the purposes of the 
following discussion. The presence of fluorescently labelled 
molecules on the chip was detected using con focal fluores- 
cence microscopy. The action of various enzymes was 
determined by monitoring the change in the amount of 
fluorescence from the molecules on the chip surface (e.g. 
'Ycading" the chip) upon treatment with enzymes that can 
cut the DNA and release the fluorescent tag at the S end. 

The three different enzymes used to characterize the 
structure of the molecules on the chip were: 

1) Mung Bean Nuclease — sequence independent, single- 
strand specific DNA endonuclease; 

2) DNase 1 — sequence independent, double-strand spe- 
cific endonueleise; 

3) EcoRl — restriction endonuclease that recognizes the 
sequence (S'-S*) 

GAATTC in double stranded DNA, and cuts between the 
G and the first A. Mung Bean Nuclease and EcoRl were 
obtained from New England Biolabs, and DNase 1 was 
obtained from Boehringcr Mannheim, All enzymes were 
at a concentration of 200 units per mL in the buffer 
recommended by the manufacturer. The enzymatic reactions 
were performed in a 1 mL flow cell at 22' C. and were 
typically allowed to proceed for 90 minutes. 

Upon treatment of the chip with the enzyme EcoRl, the 
fluorescence signal in the DS EcoRl region and the 3 SS 
regions with the EcoRl half-site on the outer strand was 
reduced by about 10% of its initial value. This reduction was 
at least 5 times greater than for the other regions of the chip, 
indicating that the action of the enzyme is sequence specific 
on the chip. It was not possible to determine if the factor is 
greater than 5 in these preliminary experiments because of 
uncertainty in the constancy of the fluorescence background. 
However, because the purpose of these early ex peri menu 
was to determine whether unimolecular double- stranded 
structures could be formed and whether they could be 
specifically recognized by proteins (and not to provide a 
quantitauvc measure of enzyme specificity), qualitative dif- 
ferences between the different synthesis regions were suf- 
ficient 

The reduction in signal in the 3 SS regions with the EcoRl 
half-six on the outer strand indicated cither that the enzyme 
cuts single-stranded DNA with a particular sequence, or that 
these molecules formed a double-stranded structure that was 
recognized by the enzyme. The molecules on the chip 
surface were at a relatively high density, with an average 
spacing of approximately 100 angstroms. Thus, it was 
possible for the outer strand of one molecule to form a 
double-stranded structure with the outer strand of a neigh- 
boring molecule. In the case of the 3 SS regions with the 
EcoRl half-site on the outer strand, such a biroolecular 
double-stranded region would have the correct sequence and 
structure to be recognized by EcoRl. However, it would 
differ from the unimolecular double-stranded molecules in 
thai the inner strand remains single-stranded and thus ame- 
nable to cleavage by a single-strand specific endonuclease 
such as Mung Bean Nuclease. Therefore it was possible to 
distinguish unimolecular from bi molecular double-stranded 
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DNA molecules on the surface by their ability lo be eul by 
single and double-strand specific cndODucleases. 

In order to remove all molecules that have single-stranded 
structures and to identify uni molecular double-stranded 
molecules, (he chip was first exhaustively treated with Mung 5 
Bean Nuclease. The reduction in the fluorescence signal was 
greater by about a factor of 2 for the SS regions of the chip, 
including those with the EcoRl half-site on the outer strand 
that were cleaved by EcoRl, than for the 4 DS regions. 
Following Mung Bean Nuclease treatment, the chip was 10 
treated with either DNasc I (which cuts all remaining 
double-stranded molecules) or EcoRl (which should cut 
only the remaining double-stranded molecules with the 
correct sequence). Upon treatment with DNase I, the fluo- 
rescence signal in the 4 DS regions was reduced by at least is 
5-fold more than the signal in the SS regions. Upon EcoRl 
treatment, the signal in the single DS region with the correct 
EcoRl sequence was reduced by at least a factor of 3 more 
than the signal in any other region on the chip. Taken 
together, these results indicated thai the surface-bound mol- 20 
ccules synthesized with two complementary strands sepa- 
rated by a flexible PEG linker form intramolecular double- 
stranded structures that were resistant to a single -strand 
specific endonucleasc and were recognized by both a 
double-strand specific endonuclease, and a sequence-spe- 23 
cific restriction enzyme. 

What is claimed is: 

1. A synthetic uni molecular, double-stranded oligonucle- 
otide library comprising a plurality of different members, 
each member having the formula: 



y-L'—X'— L'— X 1 



wherein, 
Y is a solid support; 

X 1 and X 3 are a pair of complementary oligonucleotides; 
L 1 is a spacer: 

L 3 is a linking group having sufficient length such that X 1 
and X 3 form a double-stranded oligonucleotide. 

2. A library in accordance with claim 1. wherein L 3 is a 
polyethylene glycol group. 

3. A library in accordance with claim 1, wherein X 1 and 
X 3 are complementary oligonucleotides each comprising of 
from 6 to 30 nucleic acid monomers. 

4. A library in accordance with claim 1, wherein said solid 
support is a silica support and L 1 comprises an aminoaJkyl- 
silane and from 1 to 4 heaaethyleneglycols. 

5. A library in accordance with claim 1, wherein said solid 
support is a silica support, L 1 comprises an aminoalkylsilane 
and from 1 to 4 hexaethyleneglycols. L 3 is a polycttylcncg- 
lycol group and X 1 and X z arc complementary oligonucle- 
otides each comprising of from 6 to 30 nucleic acid mono- 
men. 

6. A synthetic unimolecular, double-stranded oligonucle- 
otide library of claim 1, wherein a portion of said double- 
stranded oligonucleotides formed by X 1 and X 3 further 
comprise a loop. 
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BACKGROUND OF THE INVENTION 

The present invention relates to the field of polymer 
synthesis. More specifically, the invention provides a reactor 
system, a masking strategy, rjbotciemovable protective 
groups, data collection and processing techniques, and appli- 
cations fa light directed synthesis of diverse polymer 
sequences on substrates. 

SUMMARY OF THE INVENTION 



to 
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flow of fluids from the reactor system, selectively activating 
the translation stage, and selectively uluminating the sub- 
strate so as to form a plurality of diverse polymer sequences 
on the substrate at predetermined locations. 

The invention also provides a technique for selection of 
linfrrr molecules in a very large scale immobilized polymer 
synthesis (VLSIPS™) method According to this aspect of 
the invention, the invention provides a method of screening 
a plurality of KnWr polymers for use in binding affinit y 
studies. The invention includes the steps of forming a 
plurality of Knlrrr polymers on a substrate in selected 
regions, the linker polymers farmed by the steps of recur- 
sively: on a surface of a substrate, irradiating a portion of the 
selected regions to remove a protective group, and contact- 
ing the surface with a monomer, contacting the plurality of 
linirrr polymers with a ligand; and contacting the ligand with 
a labeled receptor. 

According to another aspect of the invention, improved 
photore movable protective groups are provided. According 
to this aspect of the invention a compound having the 
formula: 
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OMe 



X 



Methods, apparatus, and comrxwmons for synthesis and 
use of diverse polymer sequences on a substrate are 
disclosed, as well as applications thereof. 

According to one aspect of the invention, an improved 
reactor system for synthesis of diverse polymer sequences 
on a substrate is provided. According to this embodiment the 
invention provides for a reactor for contacting reaction fluids 
to a substrate; a system for delivering selected reaction fluids 
to the reactor, a translation stage for moving a mass: or 
substrate from at least a first relative location relative to a 
second relative location; a tight for illuminating the substrate 
through a mask at selected times; and an appropriately 
programmed digital computer for selectively directing a 



wherein n=0 or 1; Yis selected from the group consisting of 
an oxygen of the carboxyl group of a natural or unnatural 
iminn add. an amino group of a natural or unnatural amino 
arirf or the C-5* oxygen group of a natural or unnatural 
35 deoxyribonucleic or ribonucleic acid; R 1 and R 2 indepen- 
dently are a hydrogen atom, a lower alkyU aryL benzyl, 
halogen, hydroxy!. alkcxyL thiol thioether, amino, nitro. 
carboxyl. formate, formimirto, sulfido, or rjbospbido group; 
and R 3 is a alkoxy t alkyL aryL hydrogen, or alkenyl group 
40 is provided. 

The invention also provides improved masking tech- 
niques for the VUSSPS™ methodology. According to one 
aspect of the m»«fc™g technique, the invention provides an 
ordered method for forming a plurality of polymer 
45 sequences by sequential addition of reagents comprising the 
step of serially protecting and deprotecting portions of the 
plurality of polymer sequences for addition of other portions 
of the polymer sequences using a binary synthesis strategy. 
Improved data collection equipment and techniques are 
50 also provided. According to ooe embodiment, the instru- 
mentation provides a system for determining affinity of a 
receptor to a ligand compri sing: means for applying light to 
a surface of a substrate, the substrate comprising a plurality 
of ligands at pnddenxancd locations, the means for provid- 
es iag simultaneous ulamimation at a plurality of the predeter- 
mined locations; and an array of detectors for d etecting light 
fluoresced at the plurality of predetermined locations. The 
invention further provides for improved data analysis tech- 
niques including the steps of exposing fluorescenrJy labelled 
60 receptors to a substrate, the substrate c omprisin g a plurality 
of ligands in regions at known locations; at a plurality of 
data collection points within each of the regions, determin- 
ing an amount of light fluoresced from the data collection 
points; removing the data collection points deviating from a 
65 predetermined statistical distribution; and d etermining a 
relative binding affinity of the receptor to remaining data 
collection points. 
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Protected amino acid N-carboxy anhydrides for use in B. Binary Synthesis Strategy 

polymer synthesis are also disclosed According to this 1. Example 

aspect the invention provides a compound having the for- 2. Example 

mnh- 3. Example 

5 4. Example 

o 5. Example 

0 6. Example 

C linker Selection 

xo N 0 D. ftotecting Groups 

^sjf" io i. Use of Photoremovable Groups During Solid-Phase 

£ \ Synthesis of Peptides 

2. Use of Photoremovable Groups During Solid-Riasc 

. . Synthesis of Oligonucleotides 

where R is a side chain of a natural or unnatural amino acid ^ N -Carboxy Anhydrides Protected with a 

and X is a photoremovable protecting group. 15 £^L^ vlhl ,. r^n 

A fin^mderstanding of the nature and advantages of J^S^ ? 

the inventions herein may be realized by reference to the 1V - uatt ^oiieaion 

remaining portions of the specification and the attached A. Data Collection System 

drawings. B. Data Analysis 

20 V. Other Representative Applications 

BRIEF DESCRIPTION OF THE DRAWINGS a. Oligonucleotide Synthesis 

FIG. 1 schematically illustrates light-directed spatially- *-^ *^ n 
addressable parallel chemical synthesis; 

FIG. 2 schematically illustrates one example of light- ^ L DEFINITIONS 

directed peptide synthesis; Certain terms used herein are intended to have the fol- 

FIG. 3 is a three-dimensional representation of a portion lowing general definitions: 

of the checkerboard array of YGGFL andfPGGFL; j Complementary: 

FIG. 4 schematically illustrates an automated system for Refers to the topological compatibility or matching 

synthesizing diverse polymer sequences; 30 together of interacting surfaces of a ligand molecule and its 

FIGS. Sa and Sb illustrate operation of a program for receptor. Thus, the receptor and its ligand can be described 

polymer sytheair as complementary, and furthermore, the contact surface 

FIGS, fa and «> are a schematic illustration of a "pure" characteristics are complementary to each other. 

binary masking strategy; $ " The portion of an antigen molecule which is delineated by 

FIGS, la and lb are a schematic illustration of a gray code ^ ja^^on with the subdass of receptors known 

biliary masking strategy; M ^-^^ 

FIGS. &2 and Sfcarca schematic illustration of a modified 3 Ligand: 

gray code binary masking strategy; A ngmd is a molecule that is recognized by a particular 

FIG- 9a schematically illustrates a masking scheme for a 40 receptor. Examples of ligands that can be investigated by 

four step synthesis; this invention inrlrcrlr, but are not restricted to. agonists and 

FIG. 96 schematically illustrates synthesis of all 400 antagonists for cell membrane receptors, toxins and venoms, 

peptide dimers* viral epitopes, hormones, hormone receptors, peptides, 

FIG U is a coordinate map for the ten-step binary enzymes, enzyme substrates, cofactors, drugs (e.g. opiate^ 

- u a Dwxuia« c ^ 45 steriods, etc), lectins, sugars, oligonucleotides, nucleic 

,ynmclu; „ . acids, oligosaccharides, proteins, and monoclonal antibod- 

FIG. U schematically illustrates a data collection system; im . 

_ ICS* 

FIG. 12 is a block diagram illustrating the architecture of 4 Monomer: 

the data collection system; a member of the set of small molecules which can be 

FIG. 13 is a flow chart illustrating operation of software 50 joined together to form a polymer. The set of monomers 

for the ***** collection/analysis system; and includes but is not restricted to, for example, the set of 

FIG 14 illustrates a three-dimensional plot of intensity common L-amino acids, Che set of D-amino acids, the set of 

versus position for light directed synthesis of a dinuclcotide. synthetic amino acids, the set of nucleotides and the set of 

pentoses and hexoses. As used herein, monomers refers to 

DESCRIPTION OF THE PREFERRED 55 any member of a basis set for synthesis of a polymer. For 

EMBODIMENTS example, dilners of the 20 naturally nmming L-amino adds 

form a basis set of 400 monomers for synthesis of porypep- 

CONTENTS tides. Different basis sets of monomers may be used at 

L Definitions successive steps in the synthesis of a polymer. Furthermore, 

H. General 60 0 f the sets may include protected members which are 

Deprotection and Addition modified after synthesis. 

1. Example 5. Peptide: 

2. Example A polymer in which the monomers are alpha amino acids 
B. Antibody recognition and which are joined together through amide bonds and 

1. Example 65 alternatively referred to as a polypeptide. In the context-of 

m Synthesis this specification it should be appreciated that the amino 

A. Reactor System acids may be the L-optical isomer or the D-optical isomer. 
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Peptides arc often two or more amino acid monomers long, 
and often more ^an 20 amino acid monomers long. Stan- 
dard abbreviations for amino acids are used (e.g.. P for 
proline). These abbreviations are included in Strycr. 
Biochemistry. Third Ed.. 1988. which is incorporated herein 
by reference for all purposes. 

6. Radiation: 

Energy which may be selectively applied including 
energy having a wavelength of between 10~ 14 and 1(j 
meters including, for example, electron beam radiation, 
gamma radiation, x-ray radiation, ultraviolet radiation, vis- 
ible Hgbt. infrared radiation, microwave radiation, and radio 
waves. "Irradiation" refers to the application of radiation to 
a surface. 

7. Receptor: 

A molecule that has an affinity for a given ligand. Recep- 
tors may-be natnraUy-occurring or manmade molecules. 
Also, they can be employed in their unaltered state or as 
aggregates with other species. Receptors may be attached, 
covalently or noucovalently, to a binding member, either 
directly or via a specific binding substance. Examples of 
receptors which can be employed by this invention in cln de , 
but are not restricted to, antibodies, cell membrane 
receptors, monoclonal antibodies and antiscra reactive with 
specific antigenic determinants (such as on viruses, cells or 
other materials), drugs, polynucleotides, nucleic acids, 
peptides, cofactars. lectins, sugars, polysaccharides, cells, 
cellular membranes, and organelles. Receptors are some- 
times referred to in the art as anti-ligands. As the term 
receptors is used herein, no difference in meaning is 
intended. A ligand Receptor Pair" is formed when two 
macromolecules have combined through molecular recog- 
nition to form a complex, Other examples of receptors which 
can be investigated by this invention include but are not 
restricted to: 

a) Microorganism receptors: 
Determination of ligands which bind to receptors, such 



e) Catalytic Polypeptides: 

Polymers, preferably polypeptides, which are capable 
of promoting a chemical reaction involving the con- 
version of one or more react an is to one or more 

3 products. Such polypeptides generally include a 

binding site specific for at least one reactant or 
reaction intermediate -and an active functionality 
proximate to the binding site, which functionality is 
capable of chemically modifying the bound reactant. 

10 Catalytic polypeptides are described in. for example, 

VS. PaL No. 5.215.899, which is mcorporated 
herein by reference for all purposes. 

f) Hormone receptors: 

Examples of hormones receptors include, eg., the 
15 receptors for insulin and growth hormone. Determi- 

nation of the ligands which bind with high affinity to 
a receptor is useful in the development of, for 
example, an oral replacement of the daBy injections 
which diabetics must take to relieve the sy mpto ms of 
20 diabetes, and in the other case, a replacement for the 

scarce human growth hormone which can only be 
obtained from cadavers or by recombinant DNA 
technology. Other examples are the vasoconstrictive 
hormone r e cep tors; determination of those ligands 
25 which bind to a receptor may lead to the develop- 

ment of drugs to control blood pressure. 

g) Opiate receptees: 

Determination of ligands which bind to the opiate 
receptors in the brain is useful in the development of 
30 less-addictrve replacements for morphine and related 

drugs. 
8. Substrate: 

A material having a rigid or semi-rigid surface. In many 
embodiments, at least one surface of the substrate will be 
35 substantially flat, although in some embodiments it may be 
desirable to physically separate synthesis regions for differ- 
ent polymers with, for example, wells, raised regions, cfrhrd 
trenches, or the like. According to other embodiments, small 
beads may be provided on the surface which may be released 



as specific transport proteins or enzymes essential to 
survival of microorganisms, is useful in developing 

a new diss of antibiotics. Of particular value would ^ QpQ n completion of the synthesis, 

be antibiotics against opportunistic fungi, protozoa, 9 protective Group: 

and those bactrria resistant to the antibiotics in A material which is chemically bound to a monomer unit 

current use. and which may be removed upon selective exposure to an 

b) Enzymes: activator such as electromagnetic radiation. Examples of 
For instance, one type of receptor is the binding site of 43 protective groups with utility herein include those compris- 

enzymes such as the enzymes responsible for cleav- ing nitroptpcronyl. pyrenylmeth^xy-carboayL, nitroverxtryL 

ing neurotransmitters; determination of ligands nitrobenzyl, dimethyl dimethoxy benzyl, 5-bromo-7- 

which bind to certain receptors to modulate the nitroindolinyl, o-hydroxy-a-methyl cinnamoyl, and 

action of the enzymes which cleave the different 2-oxymethyiene anthraquinone. 

neurotransmitters is useful in the development of 50 10. Predefined Region: 

drugs which can be used in the treatment of disorders A predefined region is a localized area on a surface which 

of neurotransmission. is. was, or is i******** to be activated for formation of a 

c) Antibodies: polymer. The predefined region may have any convenient 
For instance, the invention may be useful in investi- shape, eg., circular, rectangular, elliptical, wedge-shaped, 

gating the ligand-binding site on the antibody mol- 55 etc. For the sake of brevity herein, '^predefined regions'* are 

ecule which combines with the epitope of an antigen sometimes referred to simply as "regions." 

of interest; determining a sequence that mimics an 11. Substantially Pure: 

antigenic epitope may lead to the-development of A polymer is considered to be "substantially pure** within 

vaccines of which the immunogen is based on one or a predefined region of a substrate when it exhibits charac- 

more of such sequences or lead to the development 60 teristics that distinguish it from other predefined regions. 

..... .. _ 1- t.1 • — v_; ti._ 1^. •« 1 j » ~e u:^i M : M i 



of related diagnostic agents or compounds useful in 
therapeutic treatments such as for auto-immune dis- 
eases (eg., by blocking the binding of the "self* 
antibodies). 

d) Nucleic Acids: 
Sequences of nucleic acids may be synthesized to 
establish DNA or RNA binding sequences. 



65 



Typically, purity win be measured in terms of biological 
activity or function as a result of uniform sequence. Such 
characteristics will typically be measured by way of binding 
with a selected ligand or i cceptoc. 

12. Activator refers to an energy source adapted to render a 
group active and which is directed from a source to a 
predefined location on a substrate. A primary illustration of 
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an activator is light Other examples of activators include ion After deprotection. a first of a set of building blocks 

beams, electric field*, magnetic fields, elcctroa beams, x-ray, (indicated by "A" in FIG. 1). each beanog a photoUbile 

and the like. protecting group (indicated by ~X~) is exposed to the surface 

13. Binary Synthesis Strategy refers to an ordered strategy of the substrate and it reacts with regions that were 
for parallel synthesis of diverse polymer sequences by 5 addressed by ligfct in the preceding step. The substrate is 
sequential addition of reagents which may be represented by then iUuminated through a second mask 46. which activates 
a reactant matrix, and a switch matrix, the product of which another region for reaction with a second protected building 
is a product matrix. A reactant matrix is a Ixn matrix of the block "B". The pattern of masks used in these Humiliations 
building blocks to be added. The elements of the switch ^ mc sequence 0 f reactants define the ultimate products 
matrix are binary numbers. In preferred embodiments, a lC ind locations, resulting in diverse sequences at pre- 
binary strategy is one in which at least two successive steps d£ficcd locations, as shown with the sequences ACEG and 
Nominate half of a region of interest on the substrate. In fiDFR ^ mc lowcr potion of FIG. 1. Preferred ernbodi- 
most preferred embodiments, binary synthesis refers to a mcnts of ^ ^vento* take advantage of combinatorial 
synthesis strategy which also factors a 1™™*^ masking strategies to form a large number of compoundsin 
step. For example, a strategy m wmch a switch matnx f<x z KmzX \ Mm h»dt chemical steps. 

masking s*ategy halves regions th* : were pre™u*y , c rf 

miniaturization is possible because the 
nmi'n.f^ ilhuniMhng about half of the previously lllu- . . , ,,.„i„ „„•♦», — 1 

uZ^Z^p^ctmi the remaining half (while also dew* of^mp^ u t^^ l^^ 

protecting rixwt hatf of previously protected regions and spttul advisability of the aenvator. in one case the <hf- 

E^fng about half c/previou/ry protected regions). It fraction of light. Each compound is physically accessible 

will be recognized that binary rounds may be interspersed z> and its position is precisely known. Hence, the array is 

with non-binary rounds and that only a portion of a substrate spatially-addressable and its interactions with other mol- 

may be subjected to a binary scheme, but will still be ecules can be asse s sed. 

considered to be a binary masking scheme within the In 1 particular embodiment shown in FIG. 1, the substrate 

definition b"™'» A binary "masking" strategy is a binary contains amino groups that are blocked with a photoUbile 

synthesis which uses light to remove protective groups from 25 protecting group. Amino acid sequences arc made accessible 

Tn>f^Ti»]< for addition of other matrriah such as amino acids. for coupling to a receptor by removal of the photoprotectrve 

In preferred embodiments, selected columns of the switch groups. 

matrix are arranged in order of increasing binary numbers in When a polymer sequence to be synthesized is, for 

the columns of the switch matrix. example, a polypeptide, amino groups at the ends of linkers 

14. Linker refers to a molecule or group of molecules x anAchcd to a g^ss substrate are derivatized with nitrovcra- 
attached to a substrate and spacing a synthesized polymer tryioxycarbonyl (NVOQ, a pbotoremovable protecting 
from the substrate for exposure/binding to a receptor. Tfac j^)^* may be. for example, aryl 

IL General acetylene, ethylene glycol oligomers containing from 2-10 
The pre«nt invention provides synthetic strategies and monomers, diainines. macids, ammo adds or combinations 
devices for the creation of large scale chemical diversity. V thereof. Photodeprotcction is effected by illumination of the 
Solid-phase chemistry, photoUbile protecting groups, and substrate through, for example, a mask wherein the pattern 
photolithography are brought together to achieve light- has transparent regions with dimensions of, for example, 
directed spatially-addressable parallel chemical synthesis in less than 1 cm 2 , NT 1 cm 2 , 1C 2 cm 2 . 1CT 3 cm 2 , 10 cm , 
preferred exnbodiments. 1(T 3 cm 2 , KT* cm 2 , 1(T 7 cm 2 , 1(T* cm 2 , or 1CT 10 cm 2 . In 
The invention is described herein for purposes of illus- * a preferred embodiment, the regions are between about 
tration raTmarily with regard to the preparation of peptides 10x10 urn and 300x500 urn. According to some 
and nucleotides, but could readiry be applied in the prepa- embodiments, the masks are arranged to produce a check- 
ration of other polymers. Such polymers include, for erboard array of polymers, although any one of a variety of 
example, both linear and cyclic polymers of nucleic acids, geometric configurations may be utilized, 
^saccharides, phospholipids, and peptides having either 45 1. Example 

a- B- or o>amino acids, heteroporymers in which a known In one example of the invention, free amino groups were 

drug is covalcnuy bound to any of the above, pcryuretfaanes, fluorescently labelled by treatment of the entire substrate 

polyesters, polycarbonates, polyureas. polyamides, surface with fluorescein isomiocynate (rTTQ ate pfaoto- 

poiyethylcncimines. polyaryleae sulfides, polysiloxancs. deprotection. Glass microscope slides were cleaned, ami- 

poNimides polyacetates. or other porymcn which will be so nated by treatment with 0.1% amkc^jropyltriethoxysilane in 

apparent uioo review of this disclosure. It will be recog- 95% ethanol. and incubated at 110° C for 20 nun. The 

mzed further, that Ohistratious herein are primarily with aminated surface of the slide was then exposed to a 30 mM 

reference to C- to N-terminal synthesis, but the invention solution of the N-hydroxysuccinimide ester of NVOC- 

could readiry be applied to N- to C-terminal synthesis GABA (tutroveratrylc^carbonyl-T-amino butync aad) in 

without departing from the scope of the invention. « DMF. The NVOC protecting group was photolytically 

A. Deprotection and Addition removed by imaging the 365 urn output from a Hg arc lamp 
The present iaventioa uses a masked light source or other through a chrome on glass 100 um checkerboard mask onto 

activator to direct the simultaneous synthesis of many dif- the substrate for 20 rain at a power density of 12 raW/an . 

fcrent chemical compounds. FIG. 1 is a flow chart illnstrat- The exposed surface was then treated with 1 mM FTTC in 

ing the process of forming chemical compounds according 60 DMF. The substrate surface was scanned in an epi- 

to one embodiment of the invention. Synthesis occurs on a fluorescence microscope (Zeiss Axioskop 20) using 488 nm 

solid support 2. A pattern of iUumination through a mask 4a excitation from an argon ion laser (Spectra-Physics model 

,,«in g A light source 6 determines which regions of the 2025). The fluorescence emission above 520 nm was 
support arc activated for chemical coupling. In one preferred detected by a cooled photomultiplier (Hamamatsu 943-02) 
embodiment activation is accomplished by using light to 45 operated in a photon counting mode. Fluorescence intensity 
remove photoUbile protecting groups from selected areas of was translated into a color display with red in the highest 

the substrate. intensity and black in the lowest intensity areas. The pres- 
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ence of a high-contrast fluorescent checkerboard pattern of 
100x100 um elements revealed that free amino groups were 
geaerated in specific regions by spatiallylocalized photo- 
deprotection. 
2. EXAMPLE 

FIG. 2 is a flow chart illustrating another example of the 
invention. Carboxy-activated NVOC-leucine was allowed to 
react with -an aminated substrate. The carboxy activated 
HOBT ester of leucine and other amino acids used in this 
synthesis was formed by mixing 0.25 mmol of the NVOC 
amino protected amino acid with 37 mg HOBT 
(l-hydroxybenzotriazole). Ill mg BOP (bcnzotriazolyi-n- 
oxy-tris (dimethyl amino)-phosphoniumhex a- 
fluorophosphate) and 86 ul DEEA (diisopropykthylamine) in 
2.5 ml DMF. The NVOC protecting group was removed by 
uniform illumination. Carboxy-activated NVOC- 
phenylalanine was coupled to the exposed amino groups for 
2 hours at room temperature, and then washed with DMF 
and methylene chloride. Two 'HimatW/i cycles of photo- 
deprotection and coupling with carboxy-activated NVOC- 
glycine were carried out. The surface was then uluminated 
through a chrome on glass 50 ul checkerboard pattern mask. 
Carboxy-activated Na-tBOC-O-tButyl-L-tyroane was then 
added. The entire surface was uniformly illuminated to 
photolyze the remaining NVOC groups. Finally, carboxy- 
activated NVOC-L-proline was added, the NVOC group 
was removed by illumination^ and the t-BOC and t-butyl 
protecting groups were removed with TFA. After removal of 
the protecting groups, the surface consisted of a 50 um 
checkerboard array of Tyr-Gly-Gly-Phe-Lcu (YGGFL) 
(Seq. ID No:l) and Pro-dy-Qy-Phe-Leu (PGGFLXSeq. ID 
No:2). 

B. Antibody Recognition 

In one preferred embodiment the substrate is used to 
determine which of a plurality of amino acid sequences is 
recognized by an antibody of interest 

1. EXAMPLE 

In one example, the array of pentapeptides in the example 
illustrated in FIG. 2 was probed with a mouse monoclonal 
antibody directed against p^Ddcffphin.This antibody (called 
3E7) is known to bind YGGFL and YGGFM (Seq. ID 
No:21) with nanomolar affinity and is discussed in Meo et 
aL, Prvc NatL Acad. Sci. USA (1983) 80:40*4, which is 
incorporated by reference herein for all purposes. This 
antibody requires the amino terminal tyrosine far high 
affinity binding. The array of peptides formed as described 
in FIG. 2 was incubated with a 2 ug/ml mouse monoclonal 
antibody (3E7) known to recognize YGGFL. 3E7 does not 
bind PGGFL. A second incubation with fluorescein arrd goat 
anti -mouse antibody labeled the regions that bound 3E7. The 
surface was scanned with an erx-fluorescence microscope. 
The results showed alternating bright and dark 50 um 
squares indicating that YGGFL and PGGFL were synthe- 
sized in geometric array determined by the mask. A high 
contrast (>12;1 intensity ratio) fluorescence checkerboard 
image shows that (a) YGGFL and PGGFL were synthesized 
in alternate 50 um squares, (b) YGGFL jttached to the 
surface is accessible far binding to antibody 3E7. and (c) 
antibody 3E7 docs not bind to PGGFL. 

A three-dimensional representation of the fluorescence 
intensity data in a portion of the checkboard is shown in FIG. 
3. This figure shows that the border between synthesis sites 
is sharp. The height of each spike in this display is linearly 
proportional to the integrated fluorescence intensity in a 23 
um pixel. The transition between PGGFL and YGGFL 
occurs within two spikes (5 um). There is little variation in 
the fluorescence intensity of different YGGFL squares. The 
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mean intensity of sixteen YGGFL synthesis sites was 2.03 x 
10 s counts and the standard deviation was 9.6x1 0 3 counts. 

UL Synthesis 

A. Reactor System 

5 FIG. 4 schematically illustrates a device used to synthe- 
size diverse polymer sequences on a substrate. The 
substrate, the area of synthesis, and the area for synthesis of 
each individual polymer could be of any size or shape. For 
example, squares, ellipsoids, rectangles, triangles, circles, or 

10 portions thereof, along with irregular geometric shapes may 
be utilized. Duplicate synthesis areas may also be applied to 
a single substrate for purposes of redundancy. 

In one embodiment, the predefined regions on the sub- 
strate win have a surface area of between about 1 an 2 and 

15 . 10~ lo cm 3 . In some embodiments the regions have areas of 
less than about 10" 1 cm 2 . 10 -2 cm 2 . 10" 3 cm 2 . 1CT 1 cm 2 , 10 -3 
cm 2 . KT 6 cm 2 . l(T 7 cm\ 1(T* cm 2 , 10"* cm 2 or 1CT 10 cm 2 . 
In a preferred embodiment, the regions are between about 
10x10 um 

20 in some embodiments a single substrate supports more 
than about 10 different monomer sequences and perferably 
more than about 100 different monomer sequences, although 
in some embodiments mere than about 10 3 , 10*, 10 s , 10 6 , 
10 7 , or 10* different sequences are provided on a substrate. 

25 Of course, within a region of the substrate in which a 
monomer sequence is synthesized, it is preferred that the 
monomer sequence be substantially pure. In some 
embodiments, regions of the substrate contain polymer 
sequences which are at least about 1%, 5%, 10*, 15%. 20%. 

30 25%. 30%, 35%, 40%, 45%. 50%, 60%, 70%, 80%, 90%, 
95%. 96%, 97%, 98%. or 99% pure. The device includes an 
automated peptide synthesizer 40 L The automated peptide 
synthesizer is a device which flows selected reagents 
through a flow cell 402 under the direction of a computer 

35 404. In a preferred embodiment the synthesizer is an ABI 
Peptide Synthesizer, model no. 43 LA. The computer may be 
selected from a wide variety of computers or discrete logic 
including for. example, an IBM PC- AT or similar computer 
linked with appropriate internal control systems in the 

40 peptide synthesizer. The PC is provided with tignau from 
the board computer indicative of. for example, die end of a 
coupling cycle. 

Substrate 406 is mounted on the flow cell, forming a 
cavity between the substrate and the flow cdl Selected 

45 reagents flow through this cavity from the peptide synthe- 
sizer at selected times, forming an array of peptides on the 
face of the substrate in the cavity. Mounted above the 
substrate, and preferably in contact with the substrate is a 
mask 408. Mask 408 is transparent in selected regions to a 

50 selected wavelength of light and is opaque in other regions 
to the selected wavelength of light. The mask is illuminated 
with a light source 410 such as a UV light source. In one 
specific embodiment the light source 410 is a model no. 
82420 made by OrieL The mask is held and translated by an 

55 x-y-z translation stage 412 such as an x-y translation stage 
made by Newport Corp. The computer coordinates action of 
the peptide synthesizer, x-y translation stage, and light 
source. Of course, the invention may be used in some 
embodiments with translation of the substrate instead of the 

60 mask. 

In operation, the substrate is mounted on the reactor 
cavity. The slide, with its surface protected by a suitable 
photo removable protective group, is exposed to light at 
selected locations by positioning the mask and illnminating 
65 the light source for a desired period of time (such as, for 
example, 1 sec to 60 min in the case of peptide synthesis). 
A selected peptide or other monomer/polymer is pumped 
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through the reactor cavity by the peptide synthesizer for 
binding at the selected locations on the substrate. After a 
selected reaction time (such as about 1 sec to 300 min in the 
case of peptide rcactioas) of the monomer is wished from 
the system, the mask is ap propr i ately repositioned or 5 
replaced, and the cycle is repeated. In most embodiments of 
the invention, reactions may be conducted at or near ambient 
temperature. 

FIGS. Sa and 5* axe flow charts of the software used in 
operation of the reactor system. At step 502 the peptide 10 
synthesis software is initialized. At step 504 the system 
calibrates positioners on the x-y translation stage and begins 
ft main loop. At step 506 the system determines which, if 
any. of the function keys on the computer have been pressed. 
If Flhas been pressed, the system prompts the user for input 15 
of a desired synthesis process, ff the user enters F2. the 
system allows a user to edit a file for a synthesis process at 
step 510. If the user enters F3 the system loads a process 
from a disk at step 512. If the user enters F4 the system saves 
an entered or edited process to disk at step 514. If the user 20 
selects F5 the current process is displayed at step 516 while 
selection of F6 starts the main portioa of the program. Le., 
the actual synthesis according to Che selected process. If the 
user selects F7 the system displays the location of the 
synthesized peptides, while pressing F10 returns the user to 25 
the disk operating system. 

FIG. Sb illustrates the synthesis step 5 18 in greater detail 
The main loop of the pr ogr am is started in which the system 
first moves the mask to a next position at step 526. During 
the main loop of the program, necessary chemicals flow 30 
through the reaction cell under the direction of the oa-board 
computer in the peptide synthesizer At step 528 the system 
then waits for an exposure command and. upon rcceirx of the 
exposure command exposes the substrate for a desired time 
at step 530. When an acknowledge of exposure complete is 35 
received at step 532 the system determines if the process is 
complete at step 534 and. if so, waits for additional keyboard 
input at step 536 and. thereafter, exits the perform synthesis 



A computer program used for operation of the system 40 
described above is included as microfiche Appendix A 
(Copyright, 1990. Affymax Technologies N.V., all rights 
reserved). The p iogy am is written in Turbo C++ (Borland 
Int'l) and has been implemented in an IBM compatible 
system. The motor control software is adapted from software 45 
produced by Newport Corporation. It will be recognized that 
a large variety of programming languages could be utilized 
without departing from the scope of the invention herein. 
Certain calls are made to a graphics program in "rrogram- 
mer Guide to PC and PS2 Video Systems" (Wilton. 50 
Microsoft Press, 1987), which is incorporated herein by 
reference for all purposes. 

Alignment of the mask is achieved by one of two methods 
in pi e f c u e d embodiments. In a first embodiment the system 
relies upon relative «iig«wv»«t of the various components, 55 
which is normally acceptable since x-y-z translation stages 
are capable of sufficient accuracy for the purposes herein. In 
alternative embodiments, alignment matks on the substrate 
are coupled to a CCD device for appropriate alignment 

According to some embodiments, pure reagents are not 60 
added at each step, or complete photolysis of the protective 
groups is not provided at each step. According to these 
embodiments, multiple products will be formed in each 
synthesis site. Far example, if the monomers A and B are 
mixed during a synthesis step. A and B will bind to depro- 65 
tected regions, roughly in p r opo rt i on to their concentration 
in solution. Hence, a mixture of compounds will be formed 
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in a synthesis region. A substrate formed with mixtures of 
compounds in various synthesis regions may be used to 
perform, for example, an initial screening of a large number 
of compounds, after which a smaller number of compounds 
in regions which exhibit high binding affinity are further 
screened Similar results may be obtained by only partially 
photylizing a region, adding a first monomer, re-phorylizing 
the same region, and exposing the region to a second 
monomer. 

B. Binary Synthesis Strategy 

In a light-directed chemical synthesis, the products 
formed depend on the pattern and order of masks, and on the 
order of reactants. To make a set of products there will in 
general be "n" possible masking schemes. In preferred 
embodiments of the invention herein a binary synthesis 
strategy-is utilized. The binary synthesis strategy is illus- 
trated herein primarily with regard to a masking strategy, 
although it will be applicable to other polymer synthesis 
strategies such as the pin strategy, and the like. 

In a binary synthesis strategy, the substrate is irradiated 
with a first mask, exposed to a first building block, irradiated 
with a second mask, exposed to a second building block, etc 
Each combination of masked irradiation and exposure to a 
building block is referred to herein as a "cycle." 

In a preferred binary masking scheme, the masks for each 
cycle allow irradiation of half of a region of in ter est on the 
substrate and protcctioQ of the remaining half of the region 
of interest By *1ulf* it is intended herein not to mean 
exactly one-half the region of interest, but '"^^ a large 
fraction of the region of in teres t such as from about 30 to 70 
percent of the region of interest It will be understood that 
the entire ™ tiring scheme need not take a binary form; 
instead non-binary cycles may be introduced as desired 
between binary cycles. 

In p r efeii e d embodiments of the binary masking scheme, 
a given cycle iUuminates only about half of the region which 
was iUurninated in a previous cycle, while protecting the 
remaining half of the illuminated portion from the previous 
cyde. Conversely, in such preferred embodiments, a given 
cycle illuminates half of the region which was protected in 
the previous cycle and protects half the region which was 
protected in a previous cycle. 

The synthesis strategy is most readily illustrated and 
handled in matrix notation. At each synthesis site, the 
determination of whether to add a given monomer is a binary 
process. Therefore, each product element P, is given by the 
dot product of two vectors, a chemical reaotant vector, eg.. 



0(A3,CJ>], and a 



vector o> Inspection of the 



products in the example below for a four-step synthesis, 
shows that in one four-step synthesis Oy-[1,0,1,0], o^LO, 
0.1]. Oj=[0.U,0], and a 4 =[0.1,0.1]. where a 1 indicates 
illumination and a 0 indicates protection. Therefore, it 
becomes possible to build a "switch matrix"' S from the 
column vectors Cj (j=lJc where kis the number of products). 

<J| Oj Oj C4 

110 0 
5»0 0 I I 
10 10 
0 10 1 

The outcome P of a synthesis is simply P*CS, the product 
of the chemical reactant matrix and the switch matrix. 

The switch matrix for an n-cycie synthesis yielding k 
products has n rows and k columns. An important attribute 
of S is that each row specifies a mask. A two-dimensional 
mask my for the jth chemical step of a synthesis is obtained 
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directly from the jth row of S by placing the elements . 
. . Sjj, into, for example, a square format. The particular 
arrangement below provides a square format although lin- 
ear or other arrangements may be utilized. 



*u *n *a *u 

ru m .*u $i 

S - mj = 

*i\ m *n fp ^ 

'41 *C *G '44 



Of course, compounds formed in a light-activated syn- 
thesis can be positioned in any defined geometric array. A 
square or rectangular matrix is convenient but not required. 
The rows of the switch matrix may be transformed into any 
convenient array as long as equivalent transformations are 
used for each row. 

Far example, the masks in the four-step synthesis below 
are then denoted by: 



10 



o o 



Itii — wn — (Hi ' 
0 0 11 



10 0 1 
10 0 1 



where 1 denotes illumination (activation) and 0 denotes no 
illumination. 

The matrix representation is used to generate a desired set 
of products and product map6 in preferred embodiments. 
Each compound is defined by the product of the chemical 
vector and a particular switch vector. Therefore, for each 
synthesis address, one simply saves the switch vector, 
assembles all of them into a switch matrix, and extracts each 
of the rows to form the masks. 

In some cases, particular product distributions or a maxi- 
mal number of products are desired. For example, for 
C=[AJB,CD], any switch vector (Oj) consists of four bits. 
Sixteen four-bit vectors exist Hence, a maximum of 16 
different products can be made by sequential addition of the 
reagents [A3,CX>]. These 16 column vectors can be 
assembled in 16 ! different ways to form a switch matrix. The 
order of the column vectors defines the marking patterns, 
and. therefore, the spatial ordering of products but not their 
makeup. One ordering of these columns gives the following 
switch matrix (in which "null" (6) additions are induded in 
brackets for the sake of completeness, although such null 
additions are elsewhere ignored herein): 



l 

(0 
1 

5 = [0 
1 

(0 
1 

[0 



1 1 

0 0 



1 

0 



11 1 10000000 



0 



0 0 0 0 1 1 1 1 1 



11 1000011110 



1 1 II 

0 0 0 



0 0 0 1 1 1 1 0 0 0 0 1 1 1 1] 
1001100 1 100 1100 

0 0 1 1 0 0 1 1] 
0101010101010 
101010101010 1] 



0 110 0 11 

0 1 

1 0 



A 

♦ 

B 

♦ 

c 
♦ 

D 
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30 



25 



30 



35 



40 



The columns of S according to this aspect of the invention 
are the binary representations of the numbers 15 to 0. The 
sixteen products of this binary synthesis are ABCD. ABC. 
ABD. AB. ACD. AC AD. A. BCD. BC, BD, B. CD, C, D, 
and 6 (null). Also note chat each of the switch vectors from 
the four-step synthesis masks above (and hence the synthesis 
products) are present in the four bit binary switch matrix. 
(See columns 6, 7, 10. and 11) 

This synthesis procedure provides an easy way for map- 
ping the completed products. The products in the various 
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locations on the substrate are simply defined by the columns 
of the switch matrix (the first column indicating, for 
example, thai the product ABCD will be present in the upper 
left-hand location of the substrate). Furthermore, if only 
selected desired products are to be- made . the mask sequence 
can be derived by extracting the columns with the desired 
sequences. For example, to form the product set ABCD. 
ABD, ACD. AD. BCD. BD, CD. and D. the masks are 
formed by use of a switch matrix with only the 1st 3rd. 5th, 
7th, 9m. 11th, 13th, and 15th columns arranged into the 
switch matrix: 

1 1 1 1 0 0 0 0 
110 0 110 0 

5= i o 1 o 1 o 1 o 
11111111 

To form all of the polymers of length 4, the mart ant matrix 
f ABCDABCDABCDABCDJis used. The switch matrix will 
be formed from a matrix of the binary numbers from 0 to 2 14 
arranged in columns. The columns having four monomers 
are than selected and arranged into a switch matrix, 
Therefore, it is seen that the binary switch matrix in general 
will provide a representation of all the products which can 
be made from an n-step synthesis, from which the desired 
products are then extracted. 

The rows of the binary switch matrix win. in preferred 
erobodiments. have the property that each masking step 
iflurninates half of the synthesis area. Each mis ring step 
also factors me preceding masking step; that is, half of the 
region that was illuminated in the preceding step is again 
iUunrinated, whereas the other half is not Half of the region 
that was uniUuminated in the preceding step is also 
illuminated, whereas the other half is not Thus, masking is 
recursive. The masks are constructed, as described 
previously, by extracting the elements of each row and 
placing them in a square array. For example, the four masks 
in S for a four-step synthesis are: 



l 1 
i l 



43 



mi a Mj '• 

0 0 0 0 

0 0 0 0 

110 0 

110 0 

"""i i o o" 1 * 

110 0 



1111 

0 0 0 0 

1111 

0 0 0 0 

10 10 

10 10 

10 10 

10 10 



50 



55 



The recursive factoring of masks allows the products of a 
light-directed synthesis to be represented by a polynomial 
(Some light activated syntheses can only be denoted by 
irreducible, Lc. prime polynomials.) For example, the poly- 
nomial corresponding to the top synthesis of FIG. 9a 
(discussed below) is 

EHA+BXC +D) 

A reaction polynomial may be expanded as though it were 
60 an algebraic expression, provided that the order of joining of 
reactants X! and Xj is preserved (XjX 2 ifiXOCj), Lc, the 
products are not commutative. The product then is AC+AD+ 
BC+BD. The poryncanial explicitly specifies Che reactants 
and inmlicitly specifies the mask far each step. Each pair of 
65 parentheses demarcates a round of synthesis. The chemical 
reactants of a round (eg., A and B) react at nonoverlapping 
sites and hence cannot combine with one other. The synthe- 
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ii$ area is divided equally amongst the dements of a round 
(eg., A is directed to one-calf of the area and B to the other 
half). Hence, the masks for a round (e.g., the masks m^and 
mB) are orthogonal and form an orthononnal set The 
polynomial Dotation also signifies that each element in a 
round is to be joined to each element of the next round (e.g., 
A with C. A with D. B with C and B with D). This is 
accomplished by having overlap m A an m^ equally, and 
likewise for Because C and D are elements of a round, 
fn- and nip are orthogonal to each other and form an 
orthononnal set. 

The polynomial representation of the binary synthesis 
described above, in which 16 products are made from 4 
reactants, is 

P^A+eXB+S) (C+«) 

which gives ABCD, ABC, ABD. AB, ACD, AC, AD, A 
BCD. BC, BD, B, CD, C, D. and • when expanded (with the 
rule mat 6X»X and X&=X. and remembering that joining is 
ordered). In a binary synthesis, each round contains oae 
reactant and one null (denoted by 8). Half of the synthesis 
area receives the reactant and the other half receives nothing. 
v*r* mask overlaps every other mask equally. 

Binary rounds and non-binary rounds can be interspersed 
as desired, as in 

The 18 compounds farmed are ABCE. ABCF. ABCG. 
ABDE. ABDF. ABDG, ABE. ABF, ABG. BCE, BCF, BCG. 
BDE, BDF, BDG, BE, BF. and BG. The switch matrix S for 
this 7-step synthesis is 

liiiiiiiicoooooooo 
l l i t l i i i l l l i i l i i i i 

1LI0000001 1 1000000 
5^0 001 1 100000011 1000 
100100100100100100 
010010010010010010 
001001001001001001 

The round denoted by (B) places B in all products because 
the reaction area was uniformly activated (the mask for B 
consisted entirely of l's). 

The number of compounds k formed in a synthesis 
consisting of r rounds, in which the ith round has b f chemical 
reactants and z, nulls, is 

k=KV**) 

and the number of chemical steps n is 

The number of compounds synthesized when b=a and r=0 in 
all rounds is a"*, compared with 2" far a binary synthesis. 
For n»20 and a-5. 625 compounds (all tetrameros) would be 
formed, compared with 1.049x10* compounds in a binary 
synthesis wim the same number of chemical steps. 

It should also be noted that rounds in a polynomial can be 
nested, as in 

The products are AD, BCD. BD, CD, D, A, BC B, C, and 

e. 

Binary syntheses are attractive for two reasons. First, they 
generate the m»Tim*i number of products (2") for a given 
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number of chemical steps (n). For four reactants, 16 com- 
pounds are formed in the binary synthesis, whereas only 4 
are made when each round has two reactants. A 10-step 
binary synthesis yields 1.024 compounds, and a 20- step 

5 synthesis yields 1.048,576. Second, products formed in a 
binary synthesis are a complete nested set with lengths 
ranging from 0 to n. All compounds that can be formed by 
deleting one or more units from the longest product (the 
n-mer) are present Contained within the binary set are the 
smaller sets that would be formed from the same reactants 

10 using any other set of masks (e.g.. AC AD, BC. and BD 
formed in the synthesis shown in FIG. 6 are present in the 
set of 16 formed by the binary synthesis). In some cases, 
however, the experimentally achievable spatial resolution 
may not suffice to accommodate all the compounds fo rme d 

15 Therefore, practical limitations may require one to select a 
pirricilT subset of the possible switch vectors for a given 
synthesis. 

1. EXAMPLE 

FIG. 6 illustrates a synthesis with binary masking scheme. 

20 The binary masking scheme provides the greatest number of 
sequences for a given number of cycles. According to this 
embodiment, a mask ml allows illumination of half of the 
substrate. The substrate is then exposed to the building block 
A, which binds at the illuminated regions. 

25 Thereafter, the mask m2 allows illumination of half of the 
previously fllnminatrd region, while protecting half of the 
previously illuminated region. The building block B is then 
fAA*A which binds at the uluminated regions from m2. 
The process continues with masks m3, ro4, and m5, 

30 resulting in the product array shown in the bottom portion of 
the figure, The process generates 32 (2 raised to the power 
of the number of monomers) sequences with 5 (the number 
of monomers) cycles. 

2. EXAMPLE 

35 FIG. 7 illustrates another preferred binary marking 
scheme which is referred to herein as the gray code masking 
scheme. According to this embodiment, the masks ml to m5 
are selected such that a side of any given synthesis region is 
defined by the edge of only one mask. The site at which the 

40 sequence BCDE is formed, for example* has its right edge 
defined by m5 and its left side formed by mask m4 (and no 
other mask is aligned on the sides of this site). Accordingly, 
problems created by misalignment, diffusion of light under 
the mask and the bice will be r ninirniy ^ A 

45 3. EXAMPLE 

FIG. 8 illustrates another binary masking scheme. 
According to this scheme, referred to herein as a rondffird 
gray code masking scheme, the number of masks need e d is 
rm'nimi?^ For example, the mask m2 could be the same 

50 mask as ml and simply translated laterally. Similarly, the 
mask m4 could be the same as mask m3 and simply 
translated laterally. 
4. EXAMPLE 

A four-step synthesis is shown in FIG. 9a. The reactants 
55 are the ordered set { A3.CD}. In the first cycle, illumination 
through mj activates the upper half of the synthesis area. 
Building block A is then added to give the distribution 6*2. 
Rumination through mask m? (which activates the lower 
half), followed by addition of B yields the next mtennediate 
60 distribution 604. C is added after illumination through m 3 
(which activates the left half) giving the distribution 6#4, 
and D after illumination through m* (which activates the 
right half), to yield the final product pattern 6*8 (ACAD, 
BGBD). 
65 5. EXAMPLE 

The above masking strategy for the synthesis may be 
extended for all 400 dipeptides from the 20 naturally occur- 
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ring amino acids as shown in FIG. 9b. The synthesis consists of the controls needed to assess the fidelity of a synthesis, 

of two rounds, wilh 20 photolysis and chemical coupling For example, the fluorescence signal from a synthesis area 

cycles per round. In the first cycle of round 1. mask 1 nominally containing a tetrapeptide ABCD could come from 

activates Vx*h of the substrate for coupling with the first of a tripeptide deletion impurity such as ACD. Such an artxf act 

20 amino acids. Nineteen subsequent ttiinunation/coupling 5 would be ruled out by the finding that the fluorescence 

eveies in round 1 yield a substraie consisting of 20 rectan- intensity of the ACD-site is less than that of the ABCD site, 

cular stripes each bearing a distinct member of the 20 amino The fifteen most highly labelled peptides in the array 

acids The masks of round 2 axe perpendicular to round 1 obtained with the synthesis of 1,024 peptides desarioed 

masks and therefore a single muniinanWcoupling cycle in above, were YGAFLS (SEQ. ID No:5). YGAFS (SEQ. ID 

round 2 yields 20 ^peptides. The 20 mumination/coupling io No:6). YGAFL (SEQ. ID No:T), YGGFLS (SEQ ID No*), 

cydes of round 2 complete the synthesis of the 400 dipep- YGAF (SEQ. ID No: 8), YGALS (SEQ. ID No:<£ YGGFS 

2£T ^ (SEQ.IDNo:10).YGAL(SEQ.IDNo:ll).YGAFLF(SEQ. 

6 EXAMPLE ID No: 12), YGAF (SEQ. ID No: 13). YGAFF (SEQ. ID 

The power of the binary masking strategy can be appre- No:14) f YGGLS (SEQ. ID No:15). YGGFL (SEQ. ID 

dated by the outcome of a 10-step synthesis that produced u No:16), SEQ. ID No:17), and YGAFLSF (SEQ. I fifteen 

1 024 peptides. The polynomial expression for this 10-step begin with YG. which agrees with previous work showing 

biliary ^thesis was: that an arnino-terminal tyrosine is a key determinant of 

binding. Residue 3 of this set is either A or G, and residue 

(f+«XY+«XG*«XA+eXG+a)(T+«) <MXL*«XS+eXF+«) 4 „ other F or L. The exclusion of S and T from these 

Each peptide occupied a 400x400 um square. A 32x32 20 positions is clear ait. The finding that the preferred sequence 

peptide amy (1.024 peptides, including the null peptide and is YG (A/G) (F/L) fits nicely with the outcome of a study in 

10 peptides of 1=1 . and a limited number of duplicates) was which a very large library of peptides on phage generated by 

dearlV evident in a fluorescence scan following side group recoinbinant DNA methods was screened to binding to 

deprotection and treatment with the antibody 3E7 and fluo- antibody 3E7 (see Cwida et aL, Prvc. NatL Acad Set. USA. 

resonated antibody. Each syiime^ site was a 400x400 urn 23 (1990) 87:6378, mcorporatcd herein by reference^. Addi- 

J tional binary syntheses based on leads from peptides on 

The scan showed a range of fluorescence intensities, from phage experiments show thai YGAFMQ (SE£ IDNo:18X 

a background value of 3300 counts to 22,400 counts in the YGAFM (SEQ. ID No:19), and YGAFQ (^Q-^^) 

brightest square (x=20, y=9). Only L5 compounds exhibited give stronger fluorescence signals than does YGGFM, the 

an intensity greater than 12300 counts. The median value of 30 immunogen used to obtain antibody 3E7. 

the array was 4,800 counts. Variations on the above masking strategy wffl ^ viable 

The identity of each peptide in the array could be deter- in certain circumstances. For example, if a -kernel 

mined from its x and ycoordinates (each range from 0 to 31) sequence of interest consists of PQR separated from XYZ 

2nd the map of FIG. 10. The chemical units at positions Z and that the aim is to synthesize peptides in which 1 these 

5 6 9 and 10 are specified by the y coordinate and those at 33 units are separated by a variable number of diffoent 

positions 1, 3.4.7, 8 by me x coordinate. All but one of the residues, then the kernel can be placed in each peptide by 

peptides was shorter than 10 residues. For example, the using a mask that has Ts everywhere. The polynomial 

peptide at x=12 and y=3 is YGAGF (SEQ. ID No 3) representation of a suitable synthesis is: 

(positions 1, 6, 8. 9, and 10 are nulls). YGAFLS (SEQ. ID * monvA^VMve+QYMYxrfm 
Ko?4), the brightest element of the array, is at x-20 and y=9. « (PXQXHXA^XM X c^XMXXXYXZ) 

It is often desirable to deduce a binding affinity of a given Sixteen peptides will be formed, ranging in length from the 

peptide from the measured fluorescence intensity. pQRXYZ to the 10-mer FQRABCDXYZ. 

Conceptually, the simplest case is one in which a single Several c**"- ™«*"'"g strategies will also find value in 

peptide binds to a univalent antibody molecule. The fluo- selected circumstances. By using a particular mask more 

rescence scan is carried out after the slide is washed with 43 tiaix oncc two or more reactants will appear in the same set 

buffer for a defined time. The order of fluorescence inten- 0 f products. For example, suppose mat the mask for an 

dues is then a measure primarily of the relative dis soci a ti on g-step synthesis is 
rates of the antibody -peptide complexes. If the on-rate 

constants are the same (e.g.. if they are diffusion-controlled), ^— — ^— ^— 

the order of fluorescence intensities will correspond to the 50 a imooco 

order of binding affinities. However, the situation is some- c uoonoo 

times more complex because a bivalent primary antibody d oonoou 

and a bivalent secondary antibody are used. The density of E 10101010 

peptides in a synthesis area corresponded to a mean sepa- F °}°£J£J 

ration of -7 nm. which would allow multivalent antibody- 35 H oooomi 

peptide interactions. Hence, fluorescence intensities • 

obtained according to the method herein will often be a 

quSvelSr of binding affinity. The products are ACEG, ACFG ADEG. ADFG, BCEEL 

Another important consideration is me fidelity of syntbe- BCFH, BDEH. and BDFH. A and G always appear together 

*ul Deletions axe produced by incomplete photodeprotection « because their additions were directed by the same mask, and 

or incomplete coupling. The coupling yield per cycle in likewise f or B and H. 

these experiments is typically between 85% and 95%. C. Linker Selection 

Irnplemeiirine the switch matrix by masking is imperfect According to preferred embodiments the linker molecules 

because of light diffraction, internal reflection, and scatter- used as an intermediary between the synthesized poryiners 

ing. Consequently stowaways (chemical units that should 65 and the substrate are selected for optimum length and/or 

not be on board) arise by unintended Rumination of regions type for improved binding interaction with * receptee 

that should be dark. A binary synthesis array contains many According to this aspect of the invennon diverse linkers ot 
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varying length and/or type are synthesized for subsequent 
attachment of a ligand. Through variariocLS in the length and 
type of linker, it becomes possible to optimize the binding 
interaction between an immobilized ligand and Us receptor. 

The degree of binding between a ligand (peptide, 
inhibitor, hapten, drug, etc) and its receptor (enzyme, 
antibody, etc) when one of the partners is immobilized on 
to a substrate will in some embodiments depend on the 
accessibility of the receptor in solution to the immobilized 
ligand. The accessibility in turn will depend on the length 
and/or type of linker molecule employed to immobilize one 
of the partners. Referred embodiments of the invention 
therefore employ the ULSIPS™ technology described 
herein to generate an array of, preferably, inactive or inert 
linkers of varying length and/or type, using photochemical 
protecting groups to selectively expose different regions of 
me substrate and. to build upon chemically-active groups. 

In the simplest embodiment of this concept, the same unit 
is attached to the substrate in varying multiples or lengths in 
known locations on the substrate via VLS1PS™ techniques 
to generate an array of polymers of varying length. A single 
ligand (peptide drug, hapten, etc) is attached to each of 
ftwrn and an assay is performed with the binding site to 
evaluate the degree of binding with a receptor that is known 
to bind to the ligand. In cases where the linker length 
impacts the ability of the receptor to bind to the ligand, 
varying levels of binding win be observed. In general, the 
Hntrw which provides the highest binding will then be used 
to assay other ligands synthesized in accordance with the 
techniques herein. 

According to other embodiments the binding between a 
. single tigand/receptar pah* is evaluated for linkers of diverse 
• lyywiftTfVT sequence. According to these embodiments, the 
linkers are synthesized in an array in accordance with the 
fi^finfqii^< herein and have different' monomer sequence 
(and, optionally, different lengths). Thereafter, all of the 
linker molecules arc provided with a ligand known to have 
at least some binding affinity for a given receptor. The given 
receptor is then exposed to the ligand and binding affinity is 
deduced. Linker molecules which provide adequate binding 
between the ligand and receptor are then wffli7rri in screen- 
ing studies. 
D. Protecting Groups 

As discuitcd above, selectively removable protecting 
groups allow creation of well defined areas of substrate 
surface having differing reactivities. Preferably, the protect- 
ing groups are selectively removed from the surface by 
applying a tprifir activator, such as electromagnetic radaa- 
tion of a specific wavelength and intensity. More preferably, 
the specific activator exposes selected areas of surface to 
remove the protecting groups in the exposed areas. 

Protecting groups of the present invention are used in 
conjunction with solid phase oligomer syntheses, such as 
peptide syntheses using natural or unnatural amino acids, 
nucleotide syntheses using deoxyribonucleic and ribo- 
nucleic acids, oligosaccharide syntheses, and the like. In 
addition to protecting the substrate surface from unwanted 
reaction, the protecting groups block a reactive end of the 
monomer to prevent self •rwrymerizatioa For instance, 
attachment of a protecting group to the amino terminus of an 
activated amino acid, such as an N-hydroxysuocinimide- 
activated ester of the amino acid, prevents the amino termi- 
nus of one monomer from reacting with the activated ester 
portion of another during peptide synthesis. Alternatively, 
the protecting group may be attached to the carboxyl group 
of an amino acid to prevent reaction at this site. Most 
protecting groups can be attached to either the amino or the 
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carboxyl group of an amino acid, and the nature of the 
chemical synthesis will dictate which reactive group will 
require a protecting group. Analogously, attachment of a 
protecting group to the 5'-hydroxyl group of a nucleoside 
3 during synthesis using for example, phosphate-triester cou- 
pling chemistry, prevents the 5-hydroxyl of one nucleoside 
from reacting with the 3'-activated phosphate-triester of 
another. 

Regardless of the specific use. protecting groups are 
employed to protect a moiety on a molecule from reacting 

1 with another reagent Protecting groups of the present inven- 
tion have the following characteristics: they prevent selected 
reagents from modifying the group to which they are 
attached; they are stable (that is, they remain attached to the 
molecule) to the synthesis reaction conditions; they are 

15 removable under conditions that do not adversely affect the 
remaining structure; and once removed, do not react appre- 

. dably with the surface or surface-bound oligomer. The 
selection of a suitable protecting group will depend, of 
course, on the chemical nature of the monomer unit and 

20 oligomer, as well as the specific reagents they are to protect 
against. 

In a preferred embodiment, the protecting groups are 
photoactivatable. The p r o per ti es and uses of pbotoreactive 
protecting compounds have been reviewed. See, McCray et 
23 aL, Ann. Rev. cf Biophys. and Biophys. Own. (1989) 
18239-270. which is mcorporated herein by reference. 
Preferably, the photosensitive protecting groups will be 
removable by radiation in the ultraviolet (UV) or visible 
• portion of the electromagnetic spectrum. More preferably, 
30 the protecting groups will be removable by radiation is the 
near UV or visible portion of the spectrum, In some 
embodiments, however, activation may be performed by 
. other methods such as localized hearing, electron beam 
lithography, laser pumping, oxidation or reduction with 
33 microdectrodes; and the like. Sulfonyl compounds are suit- 
able reactive groups for electron beam lithography. Oxida- 
tive or reductive removal is accomplished by exposure of the 
protecting group to an electric current source, preferably 
using rnicrodectrodes directed to the predefined regions of 
40 the surf ice which are desired for activation. Other methods 
may be used in light of this disclosure. 

Many, although not alL of the photoremovable protecting 
groups will be aromatic compounds that absorb near-UV and 
visible radiation. Suitable photoremovable protecting 
J groups are described in, far example, McCray et aL, 
Patchormk, 7. Amer. Chirm. Sec. (1970) 92 £333. and Amit 
et aL, 7. Org- Chrrru (1974) 39:192, which are mcorporated 
herein by reference. 

A preferred dass of photoremovable protecting groups 
50 has the general formula: 



55 




60 where R l , R a , R 3 , and R 4 independently are a hydrogen 
atom, a lower alkyl, aryl. benzyl, halogen, hydroxyl. 
alkoxyl. thioL thioether, amino, nirro, carboxyL formate, 
fonzuumdo or phosphido group, or adjacent substitucnts 
(It, R l -R 2 , R a -R 3 , R 3 -R 4 ) are substituted oxygen groups 

65 that together form an cyclic acetal or ketal; R 5 is a hydrogen 
atom, a alkoxyl, alkyl hydrogen, halo, aryL or alkenyl 
group, and n=0 or 1. 
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A parfeired protecting group, 6-mtrovcnnyl (NV). which 
is used far protecting the carboxyl terminus of an amino acid 
or the hydroxyl group of a nucleotide, for example, is 
farmed when R 2 and R 3 are each a me thaxy group, R 1 . R* 
and R 3 are each a hydrogen atom, and n=0: 




OMe 



OMe 



A preferred protecting group, o^mtroveiatryloorycarbonyl 
(NVOQ, which is used to protect the amino trrminns of an 
amino acid, for example is formed when R 3 and R 3 are each 
a metboxy group. R\ R 4 and R 5 are each a hydrogen atom, 
and n=l: 




OMe 



OMe 




Another preferred protecting group. 
6-mtropiperonyioxycsrbonyl (NPOQ. which is used to pro- 
tect the amino terminus of an amino acid, for example, is 
formed when R 2 and R 3 together form a methylene acetal. 
R 1 , R 4 and R 3 are each a hydrogen atom, and n=l: 




A most preferred protecting group, memyl-6-nitroveratryl 
(McNV). which is used for protecting the carboxyl terminus 
of an amino add or the hydroxyl group of a nucleotide, for 
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Another preferred protecting group, 6-fiitropiperonyl 
(NP), which is used for protecting the carboxyl terminus of 
an amino acid or the hydroxyl group of a nucleotide, for 
example, is formed when R 2 and R 3 together farm a meth- 
ylene acetal. R 1 . R 4 and R 3 are each a hydrogen atom, and 33 
n=0: 



NOx 
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OMe 



OMe 



Another most preferred protecting group, methyl-6- 
nitroverarj^loxycarbonyl (MeNVOC), which is used to pro- 
tect the amino terminus of an amino acid, for example, is 
farmed when R 3 and R 3 are each a metboxy group. R 1 and 
R 4 are each a hydrogen atom. R 3 is a methyl group, and n=l: 




OMe 



OMe 



Another most preferred protecting group, methyl-6- 
nirropiperonyl (MeNF), which is used for protecting the 
carboxyl terminus of an amino acid or the hydroxyl group of 
a nucleotide, far example, is formed when R 2 and R 3 
together form a methylene acetal. R 1 and R 4 are each a 
hydrogen atom, R 5 is a methyl group, and n=0: 




Another most preferred protecting group, methyl-6- 
nitropirxronylaxycarbonyl (MeNPOQ, which is used to 
43 protect the amino terminus of an amino acid, far example, is 
formed when R 2 and R 3 together farm a methylene acetal, 
R l and R 4 are each a hydrogen atom, R 3 is a methyl group, 
and n=l: 



NOj 




A protected amino add having a photoactivatable oxy- 
carbonyl protecting group, such NVOC or NPOC or their 
ctxrespooding methyl derivatives, MeNVOC or MeNPOC. 
respectivdy, on the amino terminus is formed by acyUting 
the amine of the amino add with an activated axycarbonyl 



example, is formed when R 3 and R 3 are each a methaxy <j ester of the protecting group. Examples of activated oxy- 
group, R l and R* are rerh a hydrogen atom, R 3 is a methyl carbonyl esters of NVOC and MeNVOC have the general 
group, and n»0: formula: 
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OMe 
NVOCJC 

O Me NOj 




15 



OMe 
MeNVOC-X 



where X is halogen, mixed anhydride, phenoxy, 20 
p-mtropheooxy, N-hydaraxy surn ni mi rlr . and the like 

A protected amino acid or nucleotide having a photoac- 
trvatable protecting group, such as NV or NP or their 
corresponding methyl derivatives, MeNV or MeNP, 15 
respectively, on the ctrboxy trmrimis of the amino acid or 
5 -hydroxy terminus of the nucleotide, is formed by acylat- 
ing the carboxy tmrw'mii or 5 -OH with an activated benzyl 
derivative of the protecting group. Examples of activated 
benzyl derivatives of MeNV and MeNP have the general 30 
formula: 



Me NOi Me NOj 




35 



where X is halogen, hydroxyl tosyU mesyl trifluormemyl 
diazo, azido, and the like. 45 

Another method for generating protected monomers is to 
react the benzyhc alcohol derivstive of the protecting group 
with an activsted ester of the monomer. For example, to 
protect the carboxyl terminus of an amino acid, an activated ^ 
ester of the amino acid is reacted with the alcohol derivative 
of the protecting group, such as 6-nitroveratrol (NVOH). 
Examples of activated esters suitable for such uses include 
halo-formate, mixed anhydride, imidazoyi formater acyl 
halide, and also includes fonnatioa of the activated ester in 55 
situ the use of r ea g en t s such as DCC and the like. 

Sec Athertoo et aL for other examples of activated esters. 

A farther method for generating protected monomers is to 
xeact the benzylic alcohol derivstive of the protecting group 60 
with an activated carbon of the monomer. Far example, to 
protect the 5 , -hydraxyl group of a nucleic acid, a derivative 
having a y -activated carbon is reacted with the alcohol 
derivative of the protecting group, such as methyl-6- 
nitropiperonol (MePyROH). Examples of nucleotides hav- 65 
ing activating groups attached to the 5'-hydroxyl group have 
the general formula: 




OP 



where Y is a halogen atom, a tosyl mesyl. trifluoromemyl. 
azido. or diazo group, and the like. 

Another class of preferred photochemical protecting 
groups has the formula: 




where R 1 , R 2 , and R 3 independently are a hydrogen atom, a 
lower alkyl aryl benzyl halogen, hydroxyl alkoxyl, thiol 
thioether, amino, nitro, carboxyl, formate, fcrmamido, 
sulfanates, sulfide or phosphido group, R 4 and R 3 indepen- 
dently are a hydrogen atom, an alkoxy. alkyl halo, aryl 
hydrogen, or alkenyl group, and n=0 or 1. 

A preferred protecting group, 
1-pyrenylmemyloxycai-bonyi (PyROQ, which is used to 
protect the amino terminus of an amino acid, for example, is 
formed when R 1 through R 5 are each a hydrogen atom and 
n=l: 




o 



Another preferred protecting group, 1-pyrenylmethyl 
(PyR), which is used for protecting the carboxy terminus of 
an amino acid or the hydroxyl group of a nucleotide, for 
example, is farmed when R 1 through R 3 are each a hydrogen 
atom and n=0: 




An amino acid having a pyrenyhnethyloxycarboayl pro- 
tecting group on its amino torm*""* is formed by acyiation 
of the free amine of amino acid with an activated oxycar- 
bonyl ester of the pyrenyl protecting group. Examples of 
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activated oxycarbonyl esters of PyROC have the general 
formula: 




where X is halogen, or mixed anhydride, p-mtrophenoxy. or 
N-hydroxysucdnimide group, and the like. 

A protected amino acid or nucleotide having a photoac- 
tivatable protecting group, such as PyR, on the carboxy 
terming* of the amino acid or y-hydraxy terminus of the 
nucleic acid, respectively, is formed by acylating the car- 
boxy terminus or 5*-OH with an activated pyrenylmethyl 
derivative of the protecting group. Examples of activated 
pyrenylmethyl derivatives of PyR have the general formula: 




where X is a halogen atom, a hydroxy L, diazo, or azido 
group, and the like. 

Another method of generating protected monomers is to 
react the pyrenylmethyl alcohol moiety of the protecting 
group with an activated ester of the monomer. For example, 
an activated ester of an amino acid can be reacted with the 
alcohol derivative of the protecting group, such as pyrenyl- 
methyl alcohol (PyROH), to form the protected derivative of 
the carboxy terminus of the amino acid Examples of acti- 
vated esters include balo-focnuUA mixed anhydride, imida- 
zoyl formate, acyl halide, and also includes formation of the 
activated ester in situ and the use of common reagents such 
as DCC and the like. 

Clearly, many photosensitive protecting groups are suit- 
able for use in the present invention. 

In preferred embodiments, the substrate is irradiated to 
remove the pbotoremovabie protecting groups and create 
regions having free reactive moieties and side products 
resulting from the protecting group. The removal rate of the 
protecting groups depends on the wavelength and intensity 
of the incident radiation, as well as the physical and chemi- 
cal pr opert i es of the protecting group itself. Preferred pro- 
tecting groups are removed at a faster rate and with a lower 
intensity of radiation. For example, at a given set of 
conditions, McNVOC and MeNPOC are pbotolytically 
removed from the K-terminus of a peptide chain faster than 
their unsubstituted parent compounds, NVOC and NPOC 
respectively. 

Removal of the protecting group is accomplished by 
irradiation to liberate the reactive group and degradation 
products derived from the protecting group. Not wishing to 
be bound by theory, it is believed that irradiation of an 
NVOC- and MeNVOC-protected oligomers occurs by the 
following reaction schemes: 



10 



15 



20 



25 



30 



35 



40 



26 



NV 0C- AA-»3 ,4-dimethoxy-6-nitros obenzaldehy de+ 
C0 3 +AA 

MeNVOC-AA-»3.4-diinethoxy^nh^ 
COj+AA 

where AA represents the N-terminus of the amino acid 
oligomer. 

Along with the unprotected amino acid, other products are 
liberated into solution: carbon dioxide and a 23-dimethoxy- 
6-nitrosophenylcarbonyl compound, which can react with 
nucleophilic portions of the oligomer to form unwanted 
secondary reactions. In the case of an NVOC-protected 
amino acid, the degradation product is a 
nitrosobenzaldehyde. while the degradation product for the 
other is a mtrosophenyl ketone. For instance, it is believed 
that the product aldehyde from NVOC degradation reacts 
with free amines to form a Schiff base (inline) that affects the 
remaining polymer synthesis. Pref er red pbotoremovabie 
protecting groups react slowly or reversibly with the oligo- 
mer on the support 

Again not wishing to be bound by theory, it is believed 
that the product ketone from irradiation of a MeNVOC- 
protected oligomer reacts at a slower rate with nucleophiles 
on the oligomer than the product aldehyde from irradiation 
of die same NVOC-protected oligomer. Although not unam- 
biguously cetermined, it is believed that this difference in 
reaction rate is due to the difference in general reactivity 
between aldehyde and ketones towards nucleophiles due to. 
stcric and electronic effects. 

The photoremovable protecting groups of the present 
invention are readily removed. For example, the photolysis 
of N-protected L-pheaylalanine in solution and having dif- 
ferent photoremovable protecting groups was analyzed, and 
the results are presented in the following table: 

TABLE 





Fbotphrsip of Protected I^Ttw OH 








\ ir 'm ■nronrh 




Schea 


NBOC 


NVOC MeNVOC 


UeKPOC 


Diouoe 

5 ciMHjSO, 


128S 

/Dkw 1575 


110 24 
98 33 


19 
22 



The half life, tl/2, is the time in seconds required to 

45 remove 50% of the starting amount of protecting group. 
NBOC is the o^nitrobenzyloxycarbonyl group, NVOC is the 
o^mtrxweratryloxycarbonyl group, MeNVOC is the methyl- 
6-nitroveratryloxycaibonyl group, and MeNPOC is the 
methyl-o^nitropiperonyloxycarbonyl group. The photolysis 

50 was carried out in the indicated solvent with 362/364 
nm-wavelength irradiation having an intensity of 10 
mW/cm 3 , and the concentration of each protected phenyla- 
lanine was 0.10 mM. 
The table shows that deprotection of NVOC-, MeNVOC-, 

55 and MeNPOC-protected phenylalanine proceeded faster 
than the deprotection of NBOC Furthermore, it shows that 
the deprotection of the two derivatives mat are substituted 
on the benzyhc carbon. MeNVOC and MeNPOC were 
photolyzed at the highest rates in both dioxane and acidified 

60 dioxane. 

1. Use of Pho t oremovable Groups During SoUd-Phase 
Synthesis of Peptides 

The formation of peptides on a solid-phase support 
requires the stepwise attachment of an amino acid to a 
65 substrate-bound growing chain. In order to prevent 
unwanted polymerization of the monomeric amino acid 
under the reaction conditions, protection of the amino ter- 
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minus of the amino acid is required. After the monomer is 
coupled to the end of the peptide, the N-terminal protecting 
group is removed, and another amino add is coopled to the 
chain. This cycle of coupling and deprotecdng is continued 
for each amino acid in the peptide sequence. See Mcmficld, 
7. Am. Chem. Soc. (1963) 85:2149. and Athertoo et aL, 
••SoKd Phase Peptide Synthesis" 1989, IRL Press. London, 
both incorporated herein by reference for all purposes. As 

described above, the use of a pbotoremovable protecting B is the base attached to the sugar ring; R is a 

group allows removal of selected portions of the substrate 10 hydrogen atom when the sugar is cteoxyribosc or R is a 
surface, via patterned irradiation, during the deprotecrion £<«P when the sugar is nbose; P represents an 

cycle of the soUd phase synthesis. This selectively allows actr ^ P^?™* f 0 *^ X jsa photc^vable 

7 , " *~ 7 " . . . 7 *~ . protecting group. The photoremovablc protecting group. X. 

spatial control of the synthesis— the next amino and is ^ p^fexzhly NV, NP, PyR. MeNV. MeNP. and the like as 

coupled only to the irradiated areas. l3 described above. The activated phosphorous group. P. is 

In one embodiment, the photoremovablc protecting preferably a reactive derivative having a high coupling 

groups of the present invention axe attached to an activated efficiency, such as a phosphate-tri ester, pbosphoraroidite or 

ester of an amino acid at the amino terminus: Hrx. Other activated poospberous derivatives, as well as 

reaction conditions, are well known (Sec Gait). 

Y NH— x 20 E. Amino Acid N-Carboxy Anhydrides Protected With a 

"Y" Pbotoremovable Group 



% During Merrifield peptide synthesis, an activated ester of 

one amino add is coupled with the free amino terminus of 
a substrate -bound oligomer. Activated esters of amino acids 
where R is the side chain of a natural or unnatural amino suitable for the solid phase synthesis include halo-formate, 
add, X is a pbotoremovable protecting group, and Y is an 25 mixed anhydride, imidazoyl formate, acyl halide, and also 
activated carboxylic add derivative. The phocoremovable includes formation of the activated ester in situ and the use 
protecting group. X is preferably NVOC, NPOC PyROC of common reagents such as DCC and the like (Sec Atberton 
MeNVOC MeNPOC an c? the like as discussed above. Hie et aL). A pre f erre d protected anact activated amino add has 
activated ester, Y. is preferably a reactive derivative having M the general formula: 
a high coupling efficiency, such as an acyl half dr. meted 
anhydride, N-hy<froxysuccinimide ester, perfluorophenyl 
ester, or urethane protected add. and the like. Other acti- 
vated esters and reaction conditions are well known (See T "o 
Amerton et aL). 33 xo n 

2. Use of Photcremovable Groups During Solid-Phase 
Synthesis of Oligonudeotides 0 o 

The formation of oligonudeotides on a solid-phase sup- 
port requires the stepwise attachment of a nudeotide to a where R is the side chain of the amino acid and X is a 
substrate -bound growing oligomer. In order to prevent 40 pbotoremovable protecting group. This compound is a 
unwanted polymerization of the monomeric nudeotide urethane-procected amino acid having a photorexnovable 
under the reaction conditions, protection of the SMrydroxyl protecting group attach to the amine. A more preferred 
group of me nucleotide is required. After the monomer is activated amino acid is formed when the pbotoremovable 
coupled to the end of the oligomer, the ^-hydroxyl protect- ^ protecting group has the general formula: 
ing group is removed, and another nucleotide is coupled to 
the chain. This cyde of coupling and deprotecting is con- 
tinued for each nucleotide in the oligomer sequence. See 
Gate Oligonucleotide Synthesis: A Practical Approach" 
1984. IRL Press, London, incorporated herein by reference so 
for all purposes. As ocscribed above, the use of a pbotore- 
movable protecting group allows removaL via patterned 
eradiation, of selected portions of the substrate surface 

faxing the deprotection cyde of the solid phase synthesis. where R 1 , R 3 , R 3 , and R 4 independently are a hydrogen 




This selectively allows spatial control of the synthesis*tbe 33 atom, a lower alkyL aryL benzyl, halogen, hydroxy L, 

next nudeotide is coupled only to the irradiated areas. attnxyi, thiol, tmocthex, amino, nitro, carboxyL formate, 

OUgonudeotide synthesis generally involves coupling an Vt?& %^ m ZJSS^^S^i 

activated phosphorous derivative on the 3-hydroxyl group "5, * , ,? ax^ub^m^ ooiygca po^ 

. Ti^^vrT^TLitK fk^ <• k^^i / f U\Z~J~ that together form a cyclic acetal or ketal; and R is a 

rfa nudeoaoe with ^ 5 -hydroxyl group of an ohgomer ^ h fl ^ ^ L ^ hydrogen, halo, aryL or 

bound to a solid support. Two major chemical methods exist aienyl croup, 

to perform this coupling: the rAosphate-triester and pnos- A .ctfvated armno acid is formed when the 

pcoramidi* methods (See Gait). Protecting groups of the pho toremovable protecting group is 

present invention are suitable for use in dther method. o^nitrovcratryioxycarbonyL That is, R l and R 4 are each a 

In a preferred embodiment, a photorcmovable protecting 65 hydrogen atom. R 2 and R 3 are each a methoxy group, and 

group is attached to an activated nudeotide on the R 3 is a hydrogen atom. Another piefened acrivcted amino 

y-hydroxyl group: add is formed when the pbotoremovable group is 
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6-nnrooiperooyi: R l and R 4 are each a hydrogen atom. R 3 
and R together form a methylene acetaL and R 3 is a 
hydrogen atom. Other protecting groups are possible. 
Another preferred activated ester is formed when the pho- 
toremovable group is meihy 1-6- nitrovcratryl or methyl-6- 
nitropiperonyl. 

Another preferred activated amino add is formed when 
the photorempvable protecting group has the general for- 
mula: 
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where R\ R 3 , and R 3 independently are a hydrogen atom, a 
lower alkyL aryL benzyl, halogen, hydroxyl, alinxyL thiol, 
thioether, . amino, nitro, carboxyl, formate, fonnaxmdo, 
sulfanates. suMdo or phospm'do group, and R 4 and R 3 
independently are a hydrogen atom, an alkoxy. alky I, halo, 
aryL hydrogen, or alkenyl group. The resulting compound is 
a urerhane-protected amino acid having a pyrenyimetfay- 
loxycarbonyl protecting group attached to the amine. A more 
preferred embodiment is formed when R 1 through R 3 are 
each a hydrogen atom. 

The urethane -protected amino adds having a phot ore- 
movable protecting group of the present invention are pre- 
pared by condensation of an N-protected amino acid with an 
acylating agent such as an acyl hilklr, anhydride, chloto- 
formaxe and the like (See Fuller et aL, U.S. Pat. No. 
4,946542 and Fuller et aL, J. Amer. Chan. Soc. (1990) 
112:7414-7416, both herein incorporated by reference for 
all purposes). 

Urethane-protected amino acids having photoremovable 
protecting groups are generally useful as reagents during 
solid-phase peptide synthesis, and because of the spatially 
selectivity possible with the photoremovable protecting 
group, are especially useful for the spatially addressable 
peptide synthesis. These amino adds are difunctiooal: the ^ 
urethane group first serves to activate the carbcxy terminus 
far reaction with the amine bound to the surface and, once 
the peptide bond is formed, the photoremovable protecting 
group protects the newry formed amino terminus from 
further reaction. These amino adds are also highly reactive x 
to nudeophiles, such as deproteoted amines on the surface 
of the solid support, and due to this high reactivity, the 
soliaVphase peptide coupling times are significantly reduced, 
and yields are typically higher. 

IV. Data Collection 55 
A. Data Collection System 

Substrates prepared in accordance with the above descrip- 
tion are used in one embodiment to determine which of the 
plurality of sequences thereon bind to a receptor of interest. 
FIG. U illustrates one " > ^ m,,nt of a device used to 60 
detect regions of a substrate which contain flourescent 
markers. This device would be used, for example, to detect 
the presence or absence of a labeled receptor such as an 
antibody which has bound to a synthesized polymer on a 
substrate. 65 

Light is directed at the substrate from a light source 1002 
such as a laser light source of the type well known to those 



of skill in the art such as a model no. 2025 made by Spectra 
Physics. Light from the source is directed at a lens 1044 
which is preferably a cylind r ical lens of the type well known 
to those of skill in the art. The resulting output from the lens 
1064 is a Uocar beam rather than a spot of light, resulting in 
the capability to detect data substantially simultaneously 
along a linear array of pixels rather than on a pixel-by-pxxel 
basis. It will be understood that a cylindrical lens is used 
herein as an illustration, of one technique for generating a 
linear beam of light on a surfacr, but that other techniques 
could also be utilized 

The beam from the cylindrical lens is passed through a 
dichroic mirror or prism (1006) and directed at the surface 
of the suitably prepared substrate 1008. Substrate 1008 is 
placed on an x-y translation stage 1009 such as a model no. 
PM5O0-8 made by Newport Ugbt at certain locations on the 
substrate will be fluoresced and transmitted along the path 
indicated by dashed lines back through the dichroic mirror, 
and focused with a suitable lens 1010 such as an VIA 
camera lens oo a linear detector 1012 via a variable f stop 
focusing lens 1014. Through use of a linear light beam, it 
bcromci possible to generate data over a line of pixels (such 
as about 1 cm) along the substrate, rather than from indi- 
vidual points on the substrate. In alternative ernbodimesls, 
light is dir e cte d at a 2-dimensional area of the substrate and 
fluoresced light detected by a 2-dimensional CCD array. 
Linear detection is preferred because substantially higher 
power dmritiVi are obtained 

Detector 1012 detects the amount of light fluoresced from 
the substrate as a function of position. According to ooe 
embodiment the detector is a linear CCD array of the type 
coenmoaly known to those of skul in the art The x-y 
translation stage, the light source, and the detector 1012 are 
all operabry connected to a computer 1016 such as an IBM 
PC-AT or equivalent for control of Che device and data 
collection from the CCD array. 

In operation, the substrate is appropriately positioned by 
the translation stage. The light source is then illuminated 
and intensity data are gathered with the computer via the 



FIG. 12 illustrates the architecture of the data collection 
system ia greater detail Operation of the system occurs 
under the direction of the photon counting piogiam 1102 
(photon), included herewith as Appendix B. The user inputs 
the scan dimensions, the number of pixels or data points in 
a region, and the scan s p e ed to the counting program. Via a 
GF1B bus 11*4 the program (in an IBM PC compatible 
computer, for example) interfaces with a twiWrhmnH scaler 
1106 such as a Stanford Research SR430 and an x-y stage 
controller 1108 such as a PM500. The signal from the light 
from the fluorescing substrate esters a photon counter U10, 
providing output to the scaler 1106. Data are output from the 
scaler indicative of the number of counts in a gives region. 
After scanning a selected area, the stage controlkr is acti- 
vated with commands for acceleration and velocity, which is 
turn drives the scan stage 1112 such as a FM500-A to 
another region. 

Data are collected in an image data file 1114 and pro- 
cessed in a scaling program 1116, also included in Appendix 
B. A scaled image is output for display on. for example, a 
VGA display 1118. The image is scaled based on an input of 
the percentage of pixels to dip and the mtmwmm and 
maximum pixel levels to be viewed The system outputs for 
use the mis asd max pixd levels in the raw data. 
B. Data Analysis 

The output from the data collection system is an array of 
data indicative of fluorescent intensity versus location on the 
substrate. The data are typically taken over regions substan- 



5.744,305 



31 

tully smaller than the area in which synthesis of a given 
polymer has taken place. Merely by way of example, if 
polymers were synthesized in squares oo the substrate 
having dimensions of 500 microns by 500 microns, the data 
may be taken over regions having dimensions of 5 micron* 
by 5 microns. In most preferred embodiments, the regioas 
over which flourescence data are taken across the substrate 
are less than about Vi the area of the regions .in which 
individual polymers are synthesized, preferably less than Vio 
the area in which a single polymer is synthesized, and most 
preferably less than Vioo the area in which a singk polymer 
is synthesized. Hence, within any area in which a given 
polymer has been synthesized, a large number of fluores- 
cence data points are collected, 

A plot of number of pixels versus intensity for a scan of 
a ceil when it has been exposed to, for example, a labeled 
antibody will typically take the form of a beU curve, but 
spurious data are observed, particularly at higher intensities. 
Since it is desirable to use an average of fluorescent intensity 
over a given synthesis region in determining relative binding 
affinity, these spurious data will tend to undesirably skew the 
data. 

Accordingly, in one embodiment of the invention the data 
are corrected for removal of these spurious data points, and 
an average of the data points is thereafter milTrrd in deter- 
mining relative hinging efficiency. 

FIG. 13 illnstrates one embodiment of a system for 
removal of spurious data from a set of fluorescence data such 
as data used in affinity screening studies. A oser or the 
system inputs data relating to the chip location and cell 
corners at step 1302. from this information and the image 
file, the system creates a computer representation of a 
histogram at step 13*4. the histogram (at least in the form of 
a computer file) plotting number of data pixels versus 
intensity. 

For each cell, a main data analysis loop is then performed. 
For each cell, at step 13*6, the system nlniiifri the total 
intensity or number of pixels for the bandwidth centered 
around varying intensity levels. For example, as shown in 
the plot to the right of step 1306, the system ralrnlairi the 
number of pixels within the band of width w. The system 
then "moves" this bandwidth to a higher center intensity, and 
again rulnilatrs the number of pixels in the bandwidth. This 
process is repeated until the entire range of intensities has 
been snnncd. and at step 1308 the system determines which 
band has the highest total number of pixels. The data within 
this bandwidth are used for further analysis. Assuming the 
bandwidth is selected to be reasonably small, this procedure 
will have the effect of eliminating spurious data located at 
the higher intensity levels. The system then repeats at step 
1310 if all cells have been evaluated, or repeats for the next 
cell. 
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At step 1312 the system then integrates me data within the 
bandwidth for each of the selected cells, sorts the data at step 
1314 using the synthesis procedure file, and displays the data 
to a user on. for example, a video display or a printer. 

5 

V. Representative Applications 
A Oligonucleotide Synthesis 
The generality . of light directed spatially addressable 
. parallel chemical synthesis is demonstrated by application to 
iq nucleic acid synthesis. 
1. Example 

Light activated formation of a mymicunecytidine dlmer 
was carried out A three dimensional representation of a 
fluorescence scan showing a checkerboard pattern generated 

13 by the light-directed synthesis of a dinudeotide is shown in 
FIG. &. 5 , -nitroveratryl thymidine was attached to a synthe- 
sis substrate through the 3' hydroxyl group. The mtroveratryl 
. protecting groups were removed by iDumination through a 
.500 mm checkerboard mask. The substrate was then treated 
with phosphoramidite activated 2'-decxycytidine. In order to 

30 foOow the reaction fluorometricaUy, the deoxycytidine had 
been modified with an FMOC protected aminohexyl KnVw 
attached to the exocycKc amine (5 , -0-dime1h<Jxytriiyl-^N- 
(6-N-fluorenylmethylcarbamoyl-hexylcarboxy)-2*- 
deoxycytidine). After removal of the FMOC protecting 

25 group with base, the regions which contained the dinucJe- 
otide were fluoresce ntly labelled by treatment of the sub- 
strate with 1 mM FITC in DMF fox one hour. 
The three-dimensional representation of the fluorescent 

. intensity data in FIG. 14 clearly reproduces the checker- 

30 board illumination pattern used daring photolysis of the 
substrate. This result demonstrates that oligonucleotides as 
weQ as peptides can be synthesized by the Ught-dzrccted 
method. 

35 VI Conclusion 

The inventions herein provide a new approach for the 
simultaneous synthesis of a large number of compounds. 
The method can be applied whenever one has chemical 
building blocks that can be coupled in a solid-phase format, 
40 and when light can be used to generate a reactive group. 

. The above description is illustrative and not restrictive. 
Many variations of the invention will become apparent to 
those of skill in the art upon review of this .disclosure. 
Merely by way of example, while the invention is illustrated 
primarily with regard to peptide and nucleotide synthesis, 
the invention is not so Irmftnri The scope of the invention 
should, therefore, be determined not with icfcie n c e to the 
above description, but instead should be determined with 
reference to the appended claims along with their full scope 
of equivalents. 



( i ) 



( i i i > 



( 2 ) INFORMATION NK S8Q H> NO*.l: 

< I ) SBQUENCB CBAIACTZUSIXCS: 
( A ) LENOTH: S Maaoaodi 
( B ) TYPE: mmmo mad 
< C ) STOANECDNES3: mm^m 
( D )TOPCCOOT. imm 
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-continued 



( i i ) MOLECULE TYPE: 
( I i ) SEQO&ICE nCSCBTPDON: SBQ ID NO:l: 

T 7 r 01 7 O 1 y P b • L • » 

I 5 



( 2 JINPORMXnONPORSEQtDNOa: 

( i )S80JUEX£CBAJtACIERISDCS: 
( A ) LENQTB: 5 ohd «ad 
( B ) TTPE: nw acid 

<C) 

( D ) TOPOLOGY: 



( i i ) 

( i i ) 5BQUEXCB DESCRIPTION: SEQ ID NOO: 

Pro O I y O 1 7 Pbi L • • 

1 3 



( 2 ) INFORMATION FOR SBQ ID N03: 

( i )S8QXIE>CE CHARACTERISTICS 
( A )l£NOIH:5 
( B )TT7£:«m» 

(C) 
(0) 



( i i >: 

( i i ) SEQUENCE DBSCSimON: SBQ ID NOJ: 
Tyt 01 y Ala O 1 y P k « 

( 2 ) INFORMATION POR S8Q ID NO:4: 

( i ) 



( A ) LENGTH: 4 
( B )TYPE:ama 

I C ) 
(D) 



( i i ) MOLECULE 
( i i ) SEQUENCE DBSCRjynOH: SBQ ID NOA 

Tyi 01 j All P ■ • L • i Sir 

1 5 



( 2 ) INFORMATION FOR SBQ ID N0t5: 

( i ) SBOJJENCE CHARACTERISTICS: 
( A ) LENGTH: 3 
( B ) TYPE: ana 
( C ) 
(D) 



(ii) 

( x i )S8qC»CIDE3CRlPT10ri:SBQIDHCh5: 

T y i 0 1 y Ala Phi S • r 

t 5 



( 2 ) INFORMATION PGR SBQ ID NOA 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: S odd aaia* 
( B ) TYPE: aviso acid 

( C ); 
(D) 
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< i i ) MQLBCULB TYPE: ppkfe 

( i i ) SOQJUENCT DCSOLDTKH: ffiQ tD H06: 

Tyr O I j Ala Pbe Lao 

I 3 

( 2 ) WFORMAXK** PCR SEQ ID H&7: 
( 1 ) 



( A JLENOTE: « 

< b )rrr£:«K» 

(C) 
(D )TOPOLOOT 



{ i i )l 

( b i ) SBOJUENCE DCSOUPTX^: S8Q ID NO:7: 

Tyr Oly Oly Pki Lao Sar 

1 5 

( 1 )DOC*MATJONPO*3SQtDNO* 

( * ) 



(A) 
( B )TTTE: 

(C) 



( 1 i ) 

( x i ) SOJOSHCE DeSCUmCK: SBQID H&fc 

Tyr Oly Ala ? a a 

I 



( 2 ) D*PORMXnCN PGR S8Q ID MO*. 

( 1 ) SBQUBNCZ CHARACTERISTICS: 
( A )LEWTH:3 

<»> 

(C) 
(D) 



( i i ) 

( m i )38QUEHCIDE3ULL7iWJN:3BQlD 

Tyr Oly Ala Lav 3 a r 

1 5 

( 2 ) MCKMXnON POR 3BQ ID MO-JO: 

( 1 ) SjBQUEHCZ CBARACTEStSTJCS: 
( A ) LB40TB: 5 aasBDaaUi 
( B )rrre:av»ab4 

(C) 
(D) 



( x i )jaj lJ P* 31JeJLBiflXJ N;8aQPNO!lO: 

Tyr Oly Oly a Sa r 

1 3 

< 2 lINKKMAnONPatSQIDNOUl: 

( i ) 



< A) 

<») 

<C) 

(D) 
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( i » )l 

( x i ) SBQCEhCE DESCRIPTION: S8Q ID Mt>U: 
Tyr Oly All Lis 

( 2 ) INFORMATION FOR SBQ TD NChC: 

( i ) SEQUENCE CHARACTERISTICS: 
( A )LENOTR:6 
<B )TYPE:m 
(C) 

( 0 ) TOPOLOGY: be* 
( i i ) MOLBCULE TYPE: pqade 
( x i ) SEQUENCE DESCJUTTION: SBQ ID NCrU: 

Tyr Oly Ala Pat L«o Paa 

t 3 

( 2 ) INFORMATION FOR SBQ ID HO:l3: 

( i ) SEQUENCE CHARACTERISTICS; 
( A ) LENOTB: 5 
( B )TT?&n 

(C) 

( D ) TOPOLOGY: 



( » i ) 

( z i ) 3BQUENCE DESCRITOON: SBQ ID NCrO: 

Tyt Oly All Pk« P » • 

I 5 

( 2 ) INFORMATION FOR SBQ ID NOW: 

( i ) 



( A ) LENGTH:! 
(B ) 

( C )i 
(D)" 



( i i ) 

( I i ) SEQUENCE DESCRIPTION: SBQ ID NOJ4: 

T y i Oly Oly L«m S • r 

1 S 



( 2 ) INFORMATION FOR SQ ID NO-.15: 

< i ) 3BQGBNCE CHARACTERISTICS: 
( A ) LEK7XB: 5 ««> acid* 
< B )TTPB: tm*n irH 
( C ) STRANDZDNESS: Mgb 
( D ) TOFOLOCnt i*» 

( i i ) MCiJEULB TYPE: p*nb 

( i t > SBQGEMCB DB9CX7DON: S8Q ID NChU: 

Tyr Oly Oly Pk« La a 

1 3 



{ 2 ) INFORMATION FOR SBQ ID HOH6: 

( i )SBQCB^CBARACrn05nCS: 
( A ) LENGTH: 6 mamo aad* 
( B )TTf&:MinA 
( C ) SDtANDEDMESS: mmj* 
< D ) TOPOLOGY: Km 
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-continued 

( i i ) MCLSOXE TTPE: papttk 

( I i ) S8QUENCE DeSCSamCH: SBQ ID Xhl6: 

Tyt Oiy AU Fb« 5n Pb « 

1 3 



( 2 ) INFORMATION FOR SBQ ID NOU7: 

( i ) SBQCENCE CTAJtACTERISTICS 
( A )LENOTB: 7 
( B ) 

(C) 

( D ) TOPOLOGY: 



( i i ) 

(si) SBQUENCE DC9CRJFTJ0N: SBQ ID MOiH: 

Tyt Ol t Alt Pb* L«« Sar Fa« 

1 5 

( 3 ) INFORMATION FOR SBQ ID NOUS: 

( * ) 



(A) 
( S )TTP£: 
(C) 
(t>) 



< i i ) 

( x 1 ) SBQCENCE mUttfll ON: SBQ ID NO: It: 
Tyi OI7 Ala Paa M« t Ola 

( 2 ) INFORMATION PGR SBQ ID NOl* 

( i ) 



( A )L»*OTB:5 

(»> 
( C ) 
(D) 



(ii) 

(x i )SB0JUBNCEDBSaUFTlOH:SBQIDNO:19: 

Tyt Olj Ala P a • M«t 

1 S 



( 2 ) INFORMATION FOR SBQ ID 

( i ) 



(A) 
(B) 
(C) 
<») 



( I i ) 

( s I ) SBQUSCZ DBSCBIFTJON: SBQ ID 



Tyt O 1 7 Ala Pbt Ola 

t 5 



( 2 ) INFORMATION FOR SBQ ID N021: 
( • ) 



(A) 
( B JITBj 
(C) 
<D) 
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FIG, 5 
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FIG. 10 
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FIG. 11 
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FIG. 13 
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Fig. 23A 



TARGET: WT 12 MER 




I I I I 

10 11 12 13 



12 MER PROBES 



TARGET: T SUBSTITUTION IN POSITION 7 




12 MER PROBES 



U.S. Patent Nov. 17, 1998 Sheet 29 of 40 5,837,832 

Fig.23B 



TARGET: C SUBSTITUTI ON IN POS!TlbN77 



3000-, 




0 1 2 3 4 5 6 7 8 9 10 11 12 13 

12 MER PROBES 





TARGET: A SUBSTITUTION IN POSITION 7 



2000-1 
1800- 
1600- 
1400- 
1200- 
1000- 
800- 
600- 
400- 
200-i 
0-? 




— i 1 1 1 1 1 1 1 1 1 i i i 

0 1 2 3 4 5 6 7 8 9 10 11 12 13 



12 MER PROBES 



U.S. Patent Nov. 17, 1998 Sheet 30 of 40 5,837,832 



4:1 Mixture of WT end 
"A" Substitution 
12-Rer Targets 




FIG. 24 



U.S. Patent 



Nov. 17, 1998 



Sheet 31 of 40 



5,837,832 



Fig. 25A 



TARGET: WT 12 MER 




10MER PROBES 



TARGET: T SUBSTITUTION IN POSITION 7 




10 MER PROBES 



U.S. Patent Nov. 17, 1998 Sheet 32 of 40 5,837,832 



Fig. 25B 
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ARRAYS OF NUCLEIC ACID PROBES ON 
BIOLOGICAL CHIPS 

CROSS-REFERENCE TO RELATED 
APPLICATION 

This is a Continuation of application Ser. No. 08/143,31 2, 
filed Oct. 26, 1993, now abandoned, which is a continuation 
in part of U.S. patent application Ser. No. 082,937, filed 25 
Jun. 1993, now abandoned, incorporated herein by refer- 
ence. 

Research leading to the invention was funded in part by 
NIH grant No. 1R01HG00813-01 and DOE grant No. 
DE-FG03-92-ER81275, and the government may have cer- 
tain rights to the invention. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention provides arrays of oligonucleotide 
probes immobilized in microfabricated patterns on silica 
chips for analyzing molecular interactions of biological 
interest. The invention therefore relates to diverse fields 
impacted by the nature of molecular interaction, including 
chemistry, biology, medicine, and medical diagnostics. 

2. Description of Related Art 

Oligonucleotide probes have long been used to detect 
complementary nucleic acid sequences in a nucleic acid of 
interest (the "target" nucleic acid). In some assay formats, 
the oligonucleotide probe is tethered, i.e., by covalent 
attachment, to a solid support, and arrays of oligonucleotide 
probes immobilized on solid supports have been used to 
detect specific nucleic add sequences in a target nucleic 
acid. See, e.g., PCT patent publication Nos. WO 89/10977 
and 89/11548. Others have proposed the use of large num- 
bers of oligonucleotide probes to provide the complete 
nucleic acid sequence of a target nucleic but failed to 
provide an enabling method for using arrays of immobilized 
probes for this purpose. See U.S. Pat. Nos. 5,202,231 and 
5,002,867 and PCT patent publication No. WO 93/17126. 

The development of VLSIPS™ technology has provided 
methods for making very large arrays of oligonucleotide 
probes in very small arrays. See U.S. Pat. No. 5,143,854 and 
PCT patent publication Nos. WO 90/15070 and 92/10092, 
each of which is incorporated herein by reference. U.S. 
patent application Ser. No. 082,937, filed Jun. 25, 1993, 
describes methods for making arrays of oligonucleotide 
probes that can be used to provide the complete sequence of 
a target nucleic acid and to detect the presence of a nucleic 
acid containing a specific nucleotide sequence. 

Microfabricated arrays of large numbers of oligonucle- 
otide probes, called "DNA chips" offer great promise for a 
wide variety of applications. New methods and reagents are 
required to realize this promise, and the present invention 
helps meet that need. 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



SUMMARY OF THE INVENTION 

The present invention provides methods for making high- 
density arrays of oligonucleotide probes on silica chips and 
for using those probe arrays to detect specific nucleic acid 
sequences contained in a target nucleic acid in a sample. The 
invention also provides arrays of oligonucleotide probes on 
DNA chips, in which the probes have specific sequences and 
locations in the array to facilitate identification of a specific 
target nucleic acid. In another aspect, the invention provides 
methods for detecting whether one or more specific 
sequences of a target nucleic add in a sample varies from a 
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previously characterized sequence or reference sequence. 
The methods of the invention can be used to detect varia- 
tions between a target and reference sequence, including 
single or multiple base substitutions, and deletions and 
insertions of bases, as well as detecting the presence, 
location, and sequence of other more complex variations 
between a target and reference sequence in a nucleic acid. 

The present invention provides arrays of oligonucleotide 
probes immobilized on a solid support. The arrays arc 
preferably synthesized directly on the support using 
VLSIPS™ technology, but other synthesis methods and 
immobilization of pre-synthesized oligonucleotide probes 
can be used to make the oligonucleotide probe arrays, called 
"DNA chips", of the invention. In general, these arrays 
comprise a set of oligonucleotide probes such that, for each 
base in a specific reference sequence, the set includes a 
probe (called the "wild-type" or "WT" probe) that is exactly 
complementary to a section of the reference sequence 
including the base of interest and four additional probes 
(called "substitution probes"), which are identical to the WT 
probe except that the base of interest has been replaced by 
one of a predetermined set (typically 4) of nucleotides. In the 
preferred embodiment, one of the four substitution probes is 
identical to the wild type probe; the other three are comple- 
mentary to targets that have a single-base substitution at this 
position. 

In another aspect, the invention relates to the arrangement 
of individual probes in the array. In one embodiment, the 
probes are arranged on the chip so that probes for a given 
position in the sequence are adjacent, and probes for adja- 
cent positions in the reference sequence are also adjacent to 
one another on the chip. One method arranges the probes for 
a single base in a short column (alternately row) and 
arranges the columns in the order of the base position to 
form horizontal (alternately vertical) stripes. The wild-type 
and each of the substitution probes have specified positions 
within the column so that all the probes corresponding to an 
A substitution, for example, are in a single row. The stripes 
may be separated on the chip by a blank row or column. 

The DNA chips of the invention can be made in a wide 
number of variations. For some applications, leaving out the 
wild-type row, leaving out unimportant bases, pooling bases, 
including insertion and deletion probes, varying the length 
of the probes within a set to make the probes have the same 
or similar Tm relative to the target or to avoid secondary 
structure, varying the mutation position, using multiple 
probes for a single mutation, providing replicate probes or 
arrays, placing blank "streets" (no probe) between rows, 
columns, or individual probes, and using control probes may 
be appropriate. 

The present invention also provides DNA chips for detect- 
ing mutations associated with cystic fibrosis, including 
mutations in exons 4, 7, 9, 10, 11, 20, and 21 of the CFTR 
gene. The invention also provides DNA chips for detecting 
mutations in the p53 gene, a gene in which mutations are 
known to be associated with a wide variety of cancers. Other 
DNA chips of the invention provide probe arrays for detect- 
ing specific sequences of mitochondrial DNA, useful for 
identification and forensic purposes. The invention also 
provides DNA chips for detecting specific sequences of 
nucleotides or mutations associated with the acquisition of a 
drug resistant phenotype in an infectious organism, such as 
rifampicin or other drug resistant TB strains and HIV, in 
which mutations in an RNA polymerase gene are known to 
give rise to drug resistance. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows how the tiling method of the invention 
defines a set of DNA probes relative to a target nucleic acid. 
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In the figure, the target is a DNA molecule, the probes are from the genomic DNA of an individual with wild-type 

single-stranded nucleic acids 16 nucleotides in length, and AF50S sequences; in panel B, the target nucleic acid origi- 

only a portion of the probes defined by the method is shown. natcd from a heterozygous (with respect to the AF5Q8 

FIG. 2 shows an illustrative tiled array of the invention mutation) individual, 

with probes for the detection of point mutations. The base at 5 . FIG. 8, in sheets 1 and 2, corresponding to panels A and 

the position of substitution in each of the wild-type probes B of FIG. 7, shows graphs of fluorescence intensity versus 

is shown in the wild-type lane, and the shading shows the tiling position. The labels on the' horizontal axis show the 

location of the substitution probe having the wild-type bases in the wild-type sequence corresponding to the posi- 

sequence. The SEQ ID. NOS. corresponding to the two tion of substitution in the respective probes. Plotted are the 

peptide sequences shown in the top portion of FIG. 2 are 311 10 intensities observed from the features (or synthesis sites) 

and 312, respectively. The SEQ ID. NOS. corresponding to containing wild-type probes, the features containing the 

the five peptide sequences listed at the bottom of FIG. 2 are substitution probes that bound the most target ("called"), and 

313, 314, 315, 313, and 316, respectively. the feature containing the substitution probes that bound the 

FIG. 3, in panels A, B, and C, shows an image made from tar B et with the second highest intensity of all the substitution 

the region of a DNA chip containing CFTR exon 10 probes; 15 P^bes ("2nd Highest"). The SEQ ID NOS. corresponding to 

in panel A, the chip was hybridized to a wild-type target; in two peptide sequences shown in sheet 2 of FIG. 8 are 332 

panel C, the chip was hybridized to a mutant AF508 target; and 318 » respectively. 

and in panel B, the chip was hybridized to a mixture of the FIG. 9 shows the human mitochondrial genome; u O H ** is 

wild -type and mutant targets. The SEQ ID. NOS. corre- the H strand origin of replication, and arrows indicate the 

spending to the four peptide sequences shown in FIG. 3 are 20 cloned unshaded sequence. 

317-320, respectively. FIG. 10 shows the image observed from application of a 

FIG, 4, in sheets 1-3, corresponding to panels A, B, and sample of mitochondrial DNA derived nucleic acid (from 

C of FIG. 3, shows graphs of fluorescence intensity versus the mt4 sample) on a DNA chip. 

tiling position. The labels on the horizontal axis show the 25 FIG. 11 is similar to FIG. 10 but shows the image 

bases in the wild-type sequence corresponding to the posi- observed from the mt5 sample. 

tion of substitution in the respective probes. Plotted are the FIG. 12 shows the predicted difference image between the 

intensities observed from the features (or synthesis sites) mt 4 and mt5 samples on the DNA chip based on mismatches 

containing wild-type probes, the features containing the between the two samples and the reference sequence, 

substitution probes that bound the most target ("called"), and 3Q nG . 13 shows me actuaI difference observed for 

the feature containing the substitution probes that bound the , he mt4 ^ mt5 sampleS( . 

target with the second highest intensity of all the substitution CT « ij4 . . , - . - . ' . - , 

«r^w u:„w,«\ ti,. cca tr» xtao ~~ - j • FIG. 14, in sheets 1 and 2, shows a plot of normalized 

probes ( 2nd Highesr ). The SEQ ID. NOS. corresponding . - ... • - A • . r ,. r , 4 . , 
* rt tK* *™ „ ' „ . . . . - crxn a intensities across rows 10 and 11 of the array and a tabula- 
te me two pepude secoiences shown in sheet 1 of FIG. 4 are » • • * . . « J 
m .^no i *u ec^in j* tion of the mutations detected. 

321 and 318, respectively; the SEQ ID. NOS. corresponding . . 

to the two peptide sequences shown in sheet 2 of FIG. 4 are HG - 15 shows the discrimination between wild-type and 

322 and 318, respectively; and the SEQ ID. NOS. corre- mulanl h y bnds obtained ™ ih me chip. A median of the six 
sponding to the two peptide sequences shown in sheet 3 of normalized hybridization scores for each probe was taken; 
FIG. 4 are 323 and 318, respectively. mc &*P h P lots the ralio of thc median score to the normal- 

or c; i * t, Ar * . . , e ized hybridization score versus mean counts. A ratio of 1.6 

rlu. 5, in panels A, B, and C, shows an image made from • . • cn . ,« r , 

a ,»m ft n «f , nxji ^j™, & 1A . _ 40 and mean counts above 50 yield no false positives. 

a region of a DNA chip containing CFTR exon 10 probes; _ J , , , . 

in panel A, the chip was hybridized to the wt480 target; in na 16 lUus * ates b° w me ^entity of the base mismatch 
panel C, the chip was hybridized to the mu480 target; and in mav ^ flucncc thc abilil y to discriminate mutant and wild- 
panel B, the chip was hybridized to a mixture of the •W* sequences ^otc than the position of the mismatch 
wild-type and mutant targets. Thc SEQ ID. NOS. corre- A < Wlthin 40 oligonucleotide probe. The mismatch position is 
sponding to the peptide sequences shown in FIG. 5 are expressed as % of probe length from the 3'-cnd. The base 
324-327, respectively. chan S e » indicated on the graph. 

RG. 6, in sheets 1-3, corresponding to panels A, B, and FIG - " provides a 5' to 3' sequence luting of one target 

C of FIG. 5, shows graphs of fluorescence intensity versus corresponding to the probes on the chip. X is a control probe, 

tiling position. The labels on the horizontal axis show the 50 Positions mal diffcr * thc (i.e., are mismatched with 

bases in the wild-type sequence corresponding to the posi- lhe P robe at the designated site) are in bold. The SEQ ID. 

tion of substitution in the respective probes. Plotted are the N0 ; corresponding to the peptide, sequence shown in FIG. 

intensities observed from the features (or synthesis sites) * 7 ^ 

containing wild-type probes, the features containing the FKi- 18 shows the fluorescence image produced by scan- 
substitution probes that bound the most target ("called"), and 55 nin 8 tne chip described in FIG. 17 when hybridized to a 
the feature containing the substitution probes that bound the sample. 

target with the second highest intensity of all the substitution FIG. 19 illustrates the detection of 4 transitions in the 

probes ("2nd Highesr**). The SEQ ID. NOS. corresponding target sequence relative to the wild-type probes on the chip 

to the two peptide sequences shown in sheet 1 of FIG. 6 are in FIG. 18. 

328 and 329, respectively; the SEQ ID. NOS. corresponding go FIG. 20 shows the alignment of some of the probes on a 

to the two peptide sequences shown in sheet 2 of FIG. 6 are p 53 DNA chip with a 12-mer model target nucleic acid. The 

330 and 329, respectively; and the SEQ ID. NOS. corre- SEQ ID. NOS. corresponding to the fourteen peptide 

sponding to the two peptide sequences shown in sheet 3 of sequences shown in FIG. 20 are 334-347, respectively. 

FIG. 6 are 331 and 329, respectively. FIG. 21 shows a set of 10-mer probes for a p53 exon 6 

FIG. 7, in panels A and B, shows an image made from a 65 DNA chip. The SEQ ID. NOS. corresponding to the thirteen 

region of a DNA chip containing CFTR exon 10 probes; in peptide sequences shown in FIG. 21 are 334 and 348-359, 

panel A, the chip was hybridized to nucleic acid derived respectively. 
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FIG. 22 shows that very distinct patterns are observed 
after hybridization of p53 DNA chips with targets having 
different 1 base substitutions. In the first image in FIG. 22, 
the 12-mer probes that form perfect matches with the 
wild-type target are in the first row (top). The 12-mer probes 
with single base mismatches axe located in the second, third, 
and fourth rows and have much lower signals. 

FIG. 23, in graphs 2, 3, and 4, graphically depicts the data 
in FIG, 22. On each graph, the X ordinate is the position of 
the probe in its row on the chip, and the Y ordinate is the 
signal at that probe site after hybridization. 

FIG. 24 shows the results of hybridizing mixed target 
populations of WT and mutant p53 genes to the p53 DNA 
chip. 

FIG. 25, in graphs 1-4, shows (see FIG. 23 as well) the 
hybridization efficiency of a 10-mer probe array as com- 
pared to a 12-mer probe array 

FIG. 26 shows an image of a p53 DNA chip hybridized to 
a target DNA. 

FIG. 27 illustrates how the actual sequence was read from 
the chip shown in FIG. 26. Gaps in the sequence of letters 
in the WT rows correspond to control probes or sites. 
Positions at which bases are miscalled are represented by 
letters in italic type in cells corresponding to probes in which 
the WT bases have been substituted by other bases. The SEQ 
ID. NO. corresponding to the peptide sequence shown in 
FIG. 27 is 360. 

FIG. 28 illustrates the VLSIPS™ technology as applied to 
the light directed synthesis of oligonucleotides. Light (hv) is 
shone through a mask (M,) to activate functional groups 
( — OH) on a surface by removal of a protecting group (X). 
Nucleoside building blocks protected with photoremovable 
protecting groups £T-X, C-X) are coupled to the activated 
areas. By repeating the irradiation and coupling steps, very 
complex arrays of oligonucleotides can be prepared. 

FIG. 29 illustrates how the VLSIPS™ process can be used 
to prepare "nucleoside combinatorials" or oligonucleotides 
synthesized by coupling all four nucleosides to form dinners, 
trimers, etc. 

FIG. 30 shows the deprotection, coupling, and oxidation 
steps of a solid phase DNA synthesis method. 

FIG. 31 shows an illustrative synthesis route for the 
nucleoside building blocks used in the VLSIPS™ method. 

FIG. 32 shows a preferred photoremovable protecting 
group, MeNPOC, and how to prepare the group in active 
form. 

FIG. 33 illustrates an illustrative detection system for 
scanning a DNA chip. 

DETAILED DESCRIPTION OF THE 
INVENTION 

Using the VLSIPS™ method, one can synthesize arrays 
of many thousands of oligonucleotide probes on a substrate, 
such as a glass slide or chip. The method can be used, for 
instance, to synthesize "combinatorial" arrays consisting of, 
for example, all possible octanucleotides. Such arrays can be 
used for primary sequencing-by-hybridization on genomic 
DNA fragments or other nucleic acids or to detect mutations 
in a target nucleic acid for which the normal or "wild-type" 
nucleotide sequence is already known. Using the preferred 
method of the invention, one employs a strategy called 
"tiling*' to synthesize specific sets of probes or at spatially- 
defined locations on a substrate, creating the novel probe 
arrays and "DNA chips" of the invention. 

To illustrate the tiling method of the invention, consider 
the problem of detecting mutations at one or more position 



in the nucleotide sequence of a target nucleic acid with 
oligonucleotide probes of defined length. The length (L) of 
the probe is typically expressed as the number of nucleotides 
or bases in a single-stranded nucleic acid probe. For pur- 
5 poses of the present invention, lengths ranging from 12 to 18 
bases are preferred, although shorter and longer lengths can 
also be employed. To employ the tiling method, one syn- 
thesizes a set of probes defined by the particular nucleotide 
sequence of interest in the target nucleic acid. For each base 
10 in the target DNA segment, one synthesizes a probe comple- 
mentary to the subsequence of the target nucleic add begin- 
ning at that base and ending L-l bases to the 3'-side (see 
FIG. 1). 

In a preferred embodiment of the invention, the probes are 
15 arranged (either by immobilization, typically by covalent 
attachment, of a pre-synthesized probe or by synthesis of the 
probe on the substrate) on the substrate or chips in lanes 
stretching across the chip and separated, and these lanes are 
in turned arranged in blocks of preferably 5 lanes, although 
20 blocks of other sizes will have useful application, as will be 
apparent from the following illustration. The first of these 
five lanes, called the "wild-type lane", contains probes 
arranged in order of sequence, and all of the probes are 
complementary to a specified wild-type nucleic acid 
25 sequence. The other four lanes contain probe sets for detect- 
ing all possible single-base mutations in the defined 
sequence; in turn, these probe sets are defined by a position 
of potential non-complementarity in the probe relative to the 
target (i.e., a single base mismatch) and the identity of the 
30 nucleotide in the probe at that position (i.e., whether the 
nucleotide is an A, C, G, or T nucleotide). The position of 
mismatch, also called the position of substitution, is prefer- 
ably selected to be near the center of the probes, i.e., position 
7 of a probe of L-15. 
35 For each probe in the wild-type lane, one synthesizes four 
probes (one for each of the lanes other than the wild-type 
lane), Three of these four probes is identical to the corre- 
sponding wild -type probe but for the base at the position of 
substitution, and the remaining probe is identical to the 
40 wild-type probe. This set of four substitution probes is 
preferably placed in a column directly below (or above) the 
corresponding wild-type probe, thus creating an A-lane, a 
C-lane, a G-lane, and a T-lane. FIG. 2 shows an illustrative 
tiled array of the invention with probes for the detection of 
45 point mutations. The base at the position of substitution in 
each of the wild-type probes is shown in the wild-type lane, 
and the shading shows the location of the substitution probe 
having the wild-type sequence. Below are the probes that 
would be placed in the column marked by the arrow if the 
50 probe length were 15 and the position of substitution were : 
7. :• • . ' . 

3-CCGACTGCAGTCGTT (SEQ. ID. NO:l) 
3-CCGACTACAGTCGTT (SEQ. ID. NO:2) 
3'-CCGACTCCAGTCGTT (SEQ. ID. NO:3) 
55 3'-CCGACTGCAGTCG1T (SEQ. ID. NO:l) 
S'-CCGACTTCAGTCGIT (SEQ. ID. NO:4) 
Thus, the substitution lanes occupy four of the five lanes 
separating successive wild-type lanes on the chip; the blocks 
of five lanes can be separated by a sixth lane for measure- 
60 ment of background signals. 

The DNA chips of the invention have a wide variety of 
applications. In one embodiment, the DNA chip is used to 
select an optimal probe from an array of probes. In this 
embodiment, an array of probes of variable length and 
65 sequences is synthesized and then hybridized to a target 
nucleic acid of known sequence. The pattern of hybridiza- 
tion reveals the optimal length and sequence composition of 
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probes to detect a particular mutation or other specific base substitution and any deletion within the 192-base exon, 

sequence of nucleotides. In some circumstances, i.e., target including the three-base deletion known as AF508. As 

nucleic acids with repeated sequences or with high G/C described in detail below, hybridization of sub-nanomolar 

content, very long probes may be required for optimal concentrations of wild-type and AF508 oligonucleotide tar- 
detection. . In one embodiment for detecting specific 5 . nucleic acids labeled with fluorescein to these arrays 

sequences in a target nucleic acid with a DNA chip, repeat P roduces ^ S W* S ™* c0 . nf ? caI 

sequences are delected as follows. The chip comprises ««j»0g fluorescence microscopy) ^ ^ 

fi *t- cc ■ . . * ' j ' * *u lion between mutant and wild-type target sequences in both 
probes of length sufficient to extend into the repeat region fa aod hc tcrozygous cases. The method and chips 
varying distances from each end The sample, prior to of ^ % cn{[oa can ^ * ^ c6 l0 dctcct othcr known 
hybridization, is treated with a labeled oligonucleotide that 10 mulations m mc CFTR gene, as described in detail below, 
is complementary to a repeat region but shorter than the full ^ most common fibrosis mutation is known as 
length of the repeat. The target nucleic is labeled with a AF508) becausc foe mutation is a three-base deletion that 
second, distinct label. After hybridization, the chip is results - m lhe rera ovaI of amino acid #508 from the CFTR 
scanned for probes that have bound both the labeled target pro tein. The present invention provides DNA chips for 
and the labeled oligonucleotide probe; the presence of such is detecting AF508, one such chip results from applying the 
bound probes shows that at least two repeat sequences are tiling method to exon 10 of the CFTR gene, the exon to 
present. which AF508 has been mapped. The tiling method involved 
A variety of methods can be used to enhance detection of the synthesis of a set of probes of a selected length in the 
labeled targets bound to a probe on the array. In one range of from 10 to 18 bases and complementary to subse- 
embodiment, the protein MutS (from £. colt) or equivalent 20 quences of the known wild-type CFTR sequence starting at 
proteins such as yeast MSH1, MSH2, and MSH3; mouse a position a few bases into the intron on the 5'-side of exon 
Rep-3, and Streptococcus Hex-A, is used in conjunction 10 and ending a few bases into the intron on the S'-side. 
with target hybridization to detect probe-target complex that There was a probe for each possible subsequence of the 
contain mismatched base pairs. The protein, labeled directly given segment of the gene, and the probes were organized 
or indirectly, can be added to the chip during or after 25 into a "lane" in such a way that traversing the lane from the 
hybridization of target nucleic acid, and differentially binds upper left-hand corner of the chip to the lower righthand 
to homo- and heteroduplex nucleic acid. A wide variety of comer corresponded to traversing the gene segment base- 
dyes and other labels can be used for similar purposes. For by-base from the 5'-end. The lane containing that set of 
instance, the dye YOYO-1 is known to bind preferentially to probes is, as noted above, called the "wild-type lane." 
nucleic acids containing sequences comprising runs of 3 or 30 Relative to the wild -type lane, a "substitution" lane, called 
more G residues. the ^A-lane", was synthesized on the chip. The A-lane 
The DNA chips produced by the methods of the invention probes were identical in sequence to ah adjacent 
can be used to study and detect mutations in exons of human (immediately below the corresponding) wild-type probe but 
genes of clinical interest, including point mutations and contained, regardless of the sequence of the wild-type probe, 
deletions. In the following sections, the method of the 35 a dA residue at position 7 (counting from the 3 f -end). In 
invention is illustrated by the detection of mutations in a similar fashion, substitution lanes with replacement bases 
variety of clinically and medically significant human nucleic dC, dG, and dT were placed onto the chip in a "C-lane," a 
acid sequences. Thus, the invention is illustrated first with "G-lane," and a "T-lane," respectively. A sixth lane on the 
respect to the preparation of DNA chips for the detection of chip consisted of probes identical to those in the wild-type 
mutations associated with cystic fibrosis, then with DNA 40 lane but for the deletion of the base in position 7 and 
chips for the detection of human mitochondrial DNA restoration of the original probe length by addition to the 
sequences, then with DNA chips for the detecU'on of muta- 5'-end the base complementary to the gene at that position, 
tions in the human p53 gene associated with cancer, and The four substitution lanes enable one to deduce the 
finally with respect to the detection of mutations in the HIV sequence of a target exon 10 nucleic acid from the relative 
RT gene associated with drug resistance. « intensities with which the target hybridizes to the probes in 
Detection of Cystic Fibrosis Mutations with DNA Chips the various lanes. The probe organization on the chip can be 
A number of years ago, cystic fibrosis, the most common conveniently columnar, and the set of probes consisting of a 
severe autosomal recessive disorder in humans, was shown wild-type probe and four corresponding substitution probes 
to be associated with mutations in a gene thereafter named is referred to as a "column set" One and only one of the four 
the Cystic Fibrosis Transmembrane Conductance Regulator 50 substitution probes in a column set has exactly the same 
(CFTR) gene. The sequences of the exons and parts of the sequence as the wild-type probe in the set. Those of skill in 
introns in the gene are known, as are the changes corrc- the art will appreciate that, in other embodiments of the 
sponding to several hundred known mutations. Several tests invention, one could delete one or more lanes or columns 
have been developed for detecting the most frequent of these and still benefit from the invention. Various versions of such 
mutations. The present invention provides CFTR gene oli- 55 exon 10 DNA chips were made as described above with 
gonucleotide arrays (DNA chips) that can be used to identify probes 15 bases long, as well as chips with probes 10, 14, 
mutations in the CFTR gene rapidly and efficiently. and 18 bases long. For the results described below, the 
The methods used to make the high-density DNA chips of probes were 15 bases long, and the position of substitution 
me invention allow probes for long stretches of DNA coding was 7 from the 3*-end. 

regions to be directly "written" onto the chips in the form of 60 To demonstrate the ability of the chip to distinguish the 

sets of overlapping oligonucleotides. These methods have AF508 mutation from the wild-type, two synthetic target 

been used to develop a number of useful CFTR gene chips, nucleic acids were made. The first, a 39-mer complementary 

one illustrative chip bears an array of 1296 probes covering to a subsequence of exon 10 of the CFTR gene having the 

the full length of exon 10 of the CFTR gene arranged in a three bases involved in the AF508 mutation near its center, 

36x36 array of 356 Xm elements. The probes in the array can 65 is called the "wild-type" or wt508 target, corresponds to 

have any length, preferably in the range of from 10 to 18 positions 111-149 of the exon, and has the sequence shown 

residues and can be used to detect and sequence any single- below: 
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5 C ATTAAAG AAAATATC ATCTTTG GTGTTTCCTAT- whose point of substitution corresponds to the T at the 3'-end 
GXTGA (SEQ. ID NO: 5). of the deletion was very close to background. Following that 

The second, a 36-mer probe derived from the wild-type pattern, the wild-type probe whose point of substitution 
target by removing those same three bases, is called the corresponds to the middle base (also a T) of the deletion 
"mutant" target or mu508 target and has the sequence shown 5 bound still less target. However, the probe in the T-lane of 
below, first with dashes to indicate the deleted bases, and that column set bound the target very well, 

then without dashes but with one base underlined (to indi- . Examination of the sequences of the two targets reveals 
cate the base detected by the T-lane probe, as discussed that the deletion places an A at that position when the 
below): sequences are aligned at their 3'-ends and that the T-lane 

5'-CATTA A AG A AAATATCAT--- 10 probe is complementary to the mutant target with but two 
TGGTGTTTCCTATGAXGA; (SEQ. ID NO:6) mismatches near an end (shown below in lower-case letters, 

S'-CATTAAAGAAAATATCATTGGTGTTTCCTATGArGA. with the position of substitution underlined): 

(SEQ. ID NO: 7) Target: 5' -CATTAAAG AAAATATC ATTGGTGT- 

Both targets were labeled with fluorescein at the 5 -end. TTCCTATGATGA 

In three separate experiments, the wild-type target, the 15 Probe: 3'-TagTAGTAACCACAA (SEQ. ID NO:8) 
mutant target, and an equimolar mixture of both targets was Thus the T-lane probe in that column set calls the correct 
exposed (0.1 nM wt508, 0.1 nM mu508, and 0.1 nM wt508 base from the mutant sequence. Note that, in the graph for 
plus 0.1 nM mu508, respectively, in a solution compatible the equimolar mixture of the two targets, that T-lane probe 

with nucleic acid hybridization) to a CF chip. The hybrid- binds almost as much target as does the A-lane probe in the 
ization mixture was incubated overnight at room 20 same column set, whereas in the other column sets, the 

temperature, and then the chip was scanned on a reader (a probes that do not have wild-type sequence do not bind 

confocal fluorescence microscope in photon-counting mode; target at all as well. Thus, that one column set, and in 

images of the chip were constructed from the photon counts) particular the T-lane probe within that set, detects the AF508 

at several successively higher temperatures while still in mutation under conditions that simulate the homozygous 
contact with the target solution. After each temperature 25 case and also conditions that simulate the heterozygous case, 
change, the chip was allowed to equilibrate for approxi- The present invention thus provides individual probes, 

mately one-half hour before being scanned. After each set of sets of probes, and arrays of probe sets on chips, in specific 

scans, the chip was exposed to denaturing solvent and patterns, as the probes provide important benefits for detect- 

conditions to wash, i.e., remove target that had bound, the ing the presence of specific exon 10 sequences. The 
chip so that the next experiment could be done with a clean 30 sequences of several important probes of the invention are 

c lrip shown below. In each case, the letter "X" stands for the point 

The results of the experiments are shown in FIGS. 3, 4, 5, of substitution in a given column set, so each of the 

and 6. FIG. 3, in panels A, B, and C, shows an image made sequences actually represents four probes, with A, C, G, and 

from the region of a DNA chip containing CFTR exon 10 T, respectively, taking the place of the "X." Sets of shorter 
probes; in panel A, the chip was hybridized to a wild-type 35 probes derived from the sets shown below by removing up 

target; in panel C, the chip was hybridized to a mutant delta to five bases from the 5*-end of each probe and sets of longer 

508 target; and in panel B, the chip was hybridized to a probes made from this set by adding up to three bases from 

mixture of the wild-type and mutant targets. FIG. 4, in sheets the exon 10 sequence to the 5*-end of each probe, are also 

1-3, corresponding to panels A, B, and C of FIG. 3, shows useful and provided by the invention, 
graphs of fluorescence intensity versus tiling position. The 40 3'-TTTATAXTAGAAACC (SEQ. ID NO:9) 

labels on the horizontal axis show the bases in the wild-type 3'-TTATAGXAGAAACCA (SEQ. ID NO:10) 

sequence corresponding to the position of substitution in the 3-TA3AGTXGAAACCAC (SEQ. ID NO:ll) 

respective probes. Plotted are the intensities observed from 3-ATAGTAXAAACCACA (SEQ. ID NO:12) 

the features (or synthesis sites) containing wild-type probes, 3-TAGTAGXAACCACAA (SEQ. ID NO:13) 

the features containing the substitution probes that bound the 45 3-AGTAGAXACCACAAA (SEQ. ID NO: 14) 

most target ("called"), and the feature containing the sub- 3-GTAGAAXCCACAAAG (SEQ. ID NO:15) 

stitution probes that bound the target with the second highest 3-TAGAAAXCACAAAGG (SEQ. ID NO: 16) 

intensity of all the substitution probes ("2nd Highest"). 3-AGAAACXACAAAGGA (SEQ. ID NO: 17) 

These figures show that, for the wild-type target and the Although in this example the sequence could not be 

equimolar mixture of targets, the substitution probe with a 50 reliably deduced near the ends of the target, where there is 

nucleotide sequence identical to the corresponding wild- : not enough overlap between target and probe , to allow 

type probe bound the most target, allowing for ah unam- effective hybridization, and around the center of the target, 

biguous assignment of target sequence as shown by letters where hybridization was weak for some other reason, per- 

near the points on the curve. The target wt508 thus hybrid- haps high AT-content, the results show the method and the 

ized to the probes in the wild-type lane of the chip, although 55 probes of the invention can be used to detect the mutation of 

• the strength of the hybridization varied from probe-to-probe, interest. The mutant target gave a pattern of hybridization 

probably due to differences in melting temperature. The that was very similar to that of the wt508 target at the ends, 

sequence of most of the target can thus be read directly from where the two share a common sequence, and very different 

the chip, by inference from the pattern of hybridization in in the middle, where the deletion is located. As one scans the 

the lanes of substitution probes (if the target hybridizes most 60 image from right to left, the intensity of hybridization of the 

intensely to the probe in the A-lane, then one infers that the target to the probes in the wild-type lane drops off much 

target has a T in the position of substitution, and so on). more rapidly near the center of the image for mu508 than for 

For the mutant target, the sequence could similarly be wt508; in addition, there is one probe in the T-lane that 

called on the 3'-side of the deletion. However, the intensity hybridizes intensely with mu508 and hardly at all with 

of binding declined precipitously as the point of substitution 65 wt508. The results from the equimolar mixture of the two 

approached the site of the deletion from the 3'-end of the targets, which represents the case one would encounter in 

target, so that the binding intensity on the wild-type probe testing a heterozygous individual for the mutation, are a 
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blend of the results for the separate targets, showing the 
power of the invention to distinguish a wild-type target 
sequence from one containing the AF508 mutation and to 
detect a mixture of the two sequences. 

The results above clearly demonstrate how the DNAchips 5 
of the invention can be used to detect a deletion mutation, 
AF508; another model system was used to show that the 
chips can also be used to detect a point mutation as well. One 
of the more frequent mutations in the CFTR gene is G480C, 
which involves the replacement of the G in position 46 of 10 
exon 10 by a T, resulting in the substitution of a cysteine for 
the glycine normally in position #480 of the CFTR protein. 
The model target sequences included the 21-mer probe 
wt480 to represent the wild-type sequence at positions 
37-55 of exon 10: 5'-CCTTCAGAGGGTAAAAITAAG 
(SEQ. ID NO:18) and the 21-mer probe mu480 to represent 
the mutant sequence: S'-CCTTCAGAGTGTAAAATTAAG 
(SEQ. ID NO:19). 
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terns. The wild-type sequence could easily be read from the 
chip, but the probe that bound the mu480 target so well when 
only the mu480 target was present also bound it well when 
both the mutant and wild-type targets were present in a 
mixture, making the hybridization pattern easily distinguish- 
able from that of the wild : type target alone. These results 
again show the power of the DNAchips of the invention to 
detect point mutations in both homo- and heterozygous 
individuals. 

To demonstrate clinical application of the DNA chips of 
the invention, the chips were used to study and detect 
mutations in nucleic acids from genomic samples. Genomic 
samples from a individual carrying only the wild-type gene 
and an individual heterozygous for AF508 were amplified by 
PCR using exon 10 primers containing the promoter for T7 
RNA polymerase. Illustrative primers of the invention arc 
shown below. 



Exon Name Sequence 



10 
10 
10 

10, 11 

11 
11 



CFi9-T7 

CFil0c-T7 

CFil0c-T3 

CFH0-T7 

CR11C-T7 

Cfillc-T3 



TAATACGACrcACTATAGGOAGatgacctMtaatgatgggm (SEQ. ID. NO20) 

TAATACGACTCACTATAGGGAGtagtgtgaagggttcatatgq (SEQ. ID. NO:21) 

CTtXK>AATTAACCCTt^CTAAAGGtagtgtgaagggttcata (SEQ. ID. N022) 

TAATACGACTXACTATAGGGAGagcatactaaaagtgactctc (SEQ. ID. N023) 

TAATACGACIX^CTATAGGGAGacatgaatgacatttacagcaa (SEQ. ID. N024) 

CGGAATTAACCCTCACTAAAGGacatgaatgacatttacagcaa (SEQ. ED. NO:25) 



In separate experiments, a DNA chip was hybridized to 30 
each of the targets wt480 and mu480, respectively, and then 
scanned with a confpeal microscope. FIG. 5, in panels A, B, 
and C, shows an image made from the region of a DNA chip 
containing CFTR exon 10 probes; in panel A, the chip was 
hybridized to the wt480 target; in panel C, the chip was 35 
hybridized to the mu480 target; and in panel B, the chip was 
hybridized to a mixture of the wild-type and mutant targets. 
FIG. 6, in sheets 1-3, corresponding to panels A, B, and C 
of FIG. 5, shows graphs of fluorescence intensity versus 
tiling position. The labels on the horizontal axis show the 40 
bases in the wild-type sequence corresponding to the posi- 
tion of substitution in the respective probes. Plotted are the 
intensities observed from the features (or synthesis sites) 
containing wild-type probes, the features containing the 
substitution probes that bound the most target ("called"), and 45 
the feature containing the substitution probes that bound the 
target with the second highest intensity of all the substitution 
probes ( tt 2nd Highest"). 

These figures show that the chip could be used to 
sequence a 16-base stretch from the center of the target 50 
wt480 and that aiscrimination against mismatches is quite 
good throughout the sequenced region. When the DNA chip 
was exposed to the target mu480, only one probe in the 
portion of the chip shown bound the target well: the probe 
in the set of probes devoted to identifying the base at 55 
position 46 in exon 10 and that has an A in the position of 
substitution and so is fully complementary to the central 
portion of the mutant target. All other probes in that region 
of the chip have at least one mismatch with the mutant target 
and therefore bind much less of it. In spite of that fact, the 60 
sequence of mu480 for several positions to both sides of the 
mutation can be read from the chip, albeit with much- 
reduced intensities from those observed with the wild-type 
target. 

The results also show that, when the two targets were 65 
mixed together and exposed to the chip, the hybridization 
pattern observed was a combination of the other two pat- 



These primers can be used to amplify exon 10 or exon 11 
sequences; in another embodiment, multiplex PCR is 
employed, using two .or more pairs of primers to amplify, 
more than one exon at a time. 

The product of amplification was then used as a template 
for the RNA polymerase, with fiuoresceinated UTP present 
to label the RNA product. After sufficient RNA was made, 
it was fragmented and applied to an exon 10 DNA chip for 
15 minutes, after which the chip was washed with hybrid- 
ization buffer and scanned with the fluorescence micro- 
scope. A useful positive control included on many CF exon 
10 chips is the 8-mer 3'-CGCCGCCG-5\ FIG. 7, in panels 
A and B, shows an image made from a region of a DNA chip 
containing CFTR exon 10 probes; in panel A, the chip was 
hybridized to nucleic acid derived from the genomic DNA of 
an individual with wild-type AF508 sequences; in panel B, 
the target nucleic acid originated from a heterozygous (with 
respect to the AF508 mutation) individual. FIG. 8, in sheets 
1 and 2, corresponding to panels A and B of FIG. 7, shows 
graphs of fluorescence intensity versus tiling position. 

These figures show that the sequence of the wild-type 
RNA can be called for most of the bases near the mutation. 
In the case of the AF508 heterozygous carrier, one particular 
probe, the same one that distinguished so clearly between 
the wild-type and mutant oligonucleotide targets in the 
model system described above, in the T-Iane binds a large 
amount of RNA, while the same probe binds little RNA from 
the wild-type individual. These results show that the DNA 
chips of the invention are capable of detecting the AF508 
mutation in a heterozygous carrier. 

Thus, the present invention provides methods for synthe- 
sizing large numbers of oligonucleotide probes on a glass 
substrate and unique probe sets in a defined array in which 
the probes are arranged in the array by the "tiling" method 
of the invention. The DNAchips produced by the method 
can be used to detect mutations in particular sequences of a 
target nucleic acid, such as genomic DNA or RNA produced 
from transcription of an amplified genomic DNA. These 
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chips can be used to detect both point mutations and small 
deletions. Moreover, the pattern of hybridization to the chip 
allows inferences to be drawn about the sequences of the 
mutant DNAs. 

For example, in the model system involving the cystic 5 
fibrosis point mutation G480C, the A-lane probe whose 
position of substitution corresponds to the position of the 
mutation does not bind much wild-type target, because in the 
wild-type sequence, a G occupies that position. However, it 
binds mutant target very well, allowing one to infer correctly io 
that the mutation involves a change of that G to a T. 
Similarly, in the case of the three-base deletion in cystic 
fibrosis known as AF508, the T-lane probe that binds mutant 
target so intensely is responding to the fact that the deletion 
has brought a CAT sequence into the position occupied by 15 
a CTT sequence in the wild-type target. The DNA chips of 
the invention can be used to detect and sequence not only 
known mutations in an organism's genome but also new 
mutations not previously characterized. The DNA chips and 
methods of the invention can also be used to detect specific 20 
sequences in other CFTR exons as well as other human 
genes for' purposes of research and clinical genetic analysis, 
as demonstrated below. 

Detection of Specific Human Mitochondrial DNA 
Sequences with DNA Chips 

As noted above, the present invention provides DNA 
chips on which a known DNA sequence is represented as an 
array of overlapping oligonucleotides on a solid support. 
This set of oligonucleotides is used to probe a target nucleic 
acid comprising the known sequence, allowing mutations to 30 
be detected. As also noted above, there are advantages in 
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some applications to using a minimal set of oligonucleotides 
specific to the sequence of interest, rather than a set of all 
possible N-mers. Some of these advantages include: (i) each 
position in the array is highly informative, whether or not 
hybridization occurs; (ii) nonspecific hybridization is mini- 
mized; (iii) it is straightforward to correlate hybridization 
differences with sequence differences, particularly with ref- 
erence to the hybridization pattern of a known standard; and 
(iv) the ability to address each probe independently during 
synthesis, using high resolution photolithography, allows the 
array to be designed and optimized for any sequence. For 
example the length of any probe can be varied independently 
of the others. 

The present invention illustrates these advantages by 
providing DNA chips and analytical methods for detecting 
specific sequences of human mitochondrial DNA In one 
preferred embodiment, the invention provides a DNA chip 
for analyzing sequences contained in a 13 kb fragment of 
human mitochondrial DNA from the "D-loop M region, the 
most polymorphic region of human mitochondrial DNA. 
One such chip comprises a set of 269 overlapping oligo- 
nucleotide probes of varying length in the range of 9-* 14 
nucleotides with varying overlaps arranged in -600x600 
micron features or synthesis sites in an array 1 cmxl cm in 
size. The probes on the chip are shown in columnar form 
below. An illustrative mitochondrial DNA chip of the inven- 
tion comprises the following probes (X, Y coordinates are 
shown, followed by the sequence; "DL3" represents the 
3'-end of the probe, which is covalently attached to the chip 
surface.) 



0 0 D L3 AGTGOGGTATTT 

1 0 DL3GGOTATTTAGTT 

2 0 DL3TTAGTTTATCCAA 

3 0 D L3 ATCCAAACCAGG 

4 0 D L3 ACCAGG ATCGG A 

5 0 D L3 CGTGTGTGTGTGG 

6 0 D L3 CGTGTGTGTGTGGC 

7 0 D L3TCGTGTGTGTGTGG 

8 0 DL3GTAGGATGGGTC 

9 0 DL3AGGATGGGTCGT 

10 0 DL3GATGGGTCGTGT 

11 0 DL3TGGCGACGATTG 

12 0 D L3 GCG ACG ATTGGG 

13 0 DUTGGGGGGGA 

14 0 DL3GAGGGGGOG 

15 0 DL3GGAGGGGGCGA 

16 0 DL3GAGGGGGOGA 

0 1 DL3GGCrTGGTTGG 

1 1 DL3GGTrGGTTTGGG 

2 1 DL3TGGGGTTTCTAG 

3 1 D L3GTTTCTAGTGGG 

4 1 DUAGTGGGGGGTGT 

5 1 DL3GGGGTGTCAAAT 

6 1 D1JGTCAAATACATCG 

7 1 DL3ACATCGAATGGAG 

8 1 DL3CGAATGGAGGAG 

9 1 DUGAGGAGTTTCGT 

10 1 DL3TTTCGTTATGTGA 

11 1 DUATGTGACITTTAC 

12 1 DL3GACTTTTACAAAT 

13 1 DL3AAATCTGCCCGA 

14 1 DL3AATCTGCOCGAG 

15 1 DL3CCCGAGTGTAGT 

16 1 D L3 AGTGTAGTGGGG 

0 2 D L3 GGG AGGGTG AG 

1 2 D L3 GGTG AGGGTATG 

2 2 D L3GGTATG ATG ATTAG 

3 2 D L3G ATTAG AGTAAGT 

4 2 DL3TTAGAGTAAGTTA 



(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ n>. 
(SEQ ID. 
(SEQ ID. 



(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 

(SEQ n>. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ n>. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 
(SEQ ID. 



NO:26) 
N027) 
NO:28) 
NO:29) 
NOJ0) 
NOJ1) 
'N032) 
NO:33) 
NO;34) 
NO:35) 
NO:36) 
NO:37) 
NOJ8) 



N039) 
NO:40) 
NO:41) 
NO:42) 
N0:43) 
NO:44) 
NO:45) 
NO:46) 
NO.47) 
NO:48) 
NO:49) 
NO-JO) 
NOJ1) 
NOJ2) 
NOJ3) 
NOJ4) 
NOJ5) 
NOJ6) 
NOJ7) 
NOJ8) 
NOJ9) 
NO:60) 
NO:61) 
NO:62) 



9 2 DL3GGTAGGATGGGT (SEQ ID. NO:67) 

10 2 DIJGGATGGGTCGTG (SEQ ID. NO:6S) 

11 2 DUGGTCGTGTGTGT (SEQ ID. NO:69) 

12 2 DUGTGTGTGTGGCG (SEQ ID. NO:70) 

13 2 DL3TGTGGCGACGAT (SEQ ID. NO:71) 

14 2 DL3GACGATTGGGGT (SEQ ID. NO:72) 

15 2 DL3ATTGGGGTATGG (SEQ ID. NO:73) 

16 2 DL3GTATGGGGCITG (SEQ ID. NO:74) 

0 3 D L3GG ATTGTGGTCG (SEQ ID. NO:75) 

1 3 DL3TGGTOGGATTGG (SEQ ID. NO:76) 

2 3 DUGGATTGGTCIAAA (SEQ ID. NO:77) 

3 3 DL3TCTAAAGTTTAAA (SEQ ID. NO:78) 

4 3 DL3GTTTAAAATAGAA (SEQ ID. NO:79) 

5 3 DUATAGAAAAACCG (SEQ ID. NO:80) 

6 3 DOAGAAAAACCGC (SEQ ID. NOSI) 

7 3 DL3AACOGOCATAC (SEQ ID. NO:82) 

8 3 DUCCATACGTGAAAA (SEQ ID. NO:83) 

9 3 DL3ACGTGAAAATTGT (SEQ ID. NO:84) 

10 3 DLJAATTGTCAGTGGG (SEQ ID. NO:85) 

11 3 DUTGTCAGTGGGGG (SEQ ID. NO:86) . 

12 3 DUTGGGGTTGA (SEQ ID. NO:87) 

13 3 DL3GGGTTGATTGTGT (SEQ ID. NO:8S) . 

14 3 DUTTGTGTAATAAAA (SEQ ID. NO:89) 

15 3 DL3AATAAAAGGGGA (SEQ ID. NO^O) 

16 3 DL3TAAAAGGGGAGG (SEQ ID. NOSl) 

0 4 DUGTTTTTTAAAGG (SEQ ID. NO:92) 

1 4 DUTTTTAAAGGTGG (SEQ ID. NO$3) 

2 4 DUAGGTGGTTTGG (SEQ ID. N034) 

3 4 DL3TTGGGGGGGAG (SEQ ID. NO:95) 

4 4 DUGOAGGGGGCG (SEQ ID. NOS6) 

5 4 DUGGGGCGAAGAC (SEQ ID. NO$7) 

6 4 DL3GAAGACOGGATG (SEQ ID. NO:98) 

7 4 DL3CCGGATGTCGTG (SEQ ID. N059) 

8 4 DUGTCGTGAATTTGT (SEQ ID. NO:100) 

9 4 DL3CGTGAATTTGTGT (SEQ ID. NO:101) 

10 4 DUTTGTGTAGAGACG (SEQ ID. NO:102) 

11 4 DL3TAGAGAOGGTTT (SEQ ID. NO:103) 

12 4 DUACGGTTTGGGO (SEQ ID. NO:104) 

13 4 DUTGGGGTTTTTGT (SEQ ID. NO:105) 

14 4 DL3GGG1 11 1 1GTTT (SEQ ID. NO:106) 
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DIJAAGTTATGTTGGG 

DL3GTTGGGGGCG 

DL3GGGGGGGGTA 

D L3 GCGGGTAGO AT 

D L3ACACAATTAATTAA 

DUAATTAATTACGAA 

DL3TACGAACATCCTG 

DL3ACGAACATCCTGT 

D L3TCCTGTA1TATTA 

DL3GTATTATTATTGTT 

DL3ATTGTTAAACTTA 

DL3AAACITACAGACG 

DL3ACAGACGTGTCO 

D L3 GTGTCGGTG AAA 

D L3 GTG AAAGGTGTGT 

D L2 GGTGTGTCTGTAG 

DL3TGTGTCTG TAGTA 

P L3 GTAGTATT GTTTT 

DL3AGTATTGTTTTTT 

D L3 CCTCGTGGG ATA 

DL3TGGGATACAGCG 

DL3GATACAGCGTCAT 

DL3GCGTCATAGACAG 

D U AG ACAG AAACTAA 

DL3CAGAAACTAAGGA 

DL3TAAGGACGGAGT 

DL3GA0GGAGTAGGA 

DL3GTAGGATAATAAA 

DL3TAATAAATAGCG 

DUATAGCGTAGGAT 

DL3TAGCGTAGGATG 

DL3AGGATGCAAGTT 

DL3ATGCAAGTTATAA 

DL3GTTATAATGTCCG 

DL3ATGTCCGCITGT 

DUTCCGCTTGTATG 

DL3GTGAGTGCCCTC 

D L3TGCCXTTOG AG AG 

DL3CCTCGAGAGGTA 

DL3AGAGGTACGTAA 

DL3ACGTAAACCATA 

DL3ACCATAAAAGCAG 

DL3AAAGCAGACCC 

DL3AGACCCCCCAT 

D L3 CCCCC ATACGT 

DL3CATACGTGCGCT 

DL3 GTGCGCTATCAG 

D L3GOGCTATCAGTA 

DUTCAGTAACGCTC 

D L3 GTAACGCrCTGC 

DL3AGTCTATCCCCA 

DUATCOCCAGGGA 

DL3CAGGOAACTGGT 

DUACTGGTGGTAGG 

DL3CTGGTGGTAGGA 

DL3GTAGGAGGCACA . 

DL3GGCACATTTAGT 

DL3TTTAGTTATAGGG 

DL3AGGTTTACGGTG 

D L3TACGGTGGGG A 

DL3GTGGGGAGTSG 

DUGGGAGTGGCTGA 

D L3 GGGTG ATCCTATG 

DIJCCXATGGTTGTTT 

DL3GGTTGTTTGGATG 

DL3GTTTGGATGGGT 

DUATGGGTGGGAAT 

DL3GGGAATTGTCATG 

DUGTCATGTATCATGT 

DUTCATGTATTTCGG 

DL3TATTTCGGTAAA 

DL3TTCGGTAAATGG 

DL3GTAAATGGCATGT 

DUGCATGTAATCGTG 

D L3 GTAATCGTGTAAT 

DL3GGGAGGGGTAC 

D L3 GGGTACG AATGT 

DL3ACGAATGTTCGTT 

DL3TGTTCGTTCATGT 

DL3CGTTCATGTCGTT 



(SEQ ID. NO:63) 15 4 

(SEQ ID. NO:64) 16 4 

(SEQ ID. NO:65) 0 5 

(SEQ ID. NO:66) 1 5 

(SEQ ID. NO:lll) 14 7 

(SEQ. ID. NO:112) 15 7 

(SEQ ID. N0:113) 16 7 

(SEQ ID. NO:114) 0 8 

(SEQ ID. N0:115) 1 8 

(SEQ ID. NO:116) 2 8 

(SEQ ID. NO:117) 3 8 

(SEQ ID. NO:118) 4 8 

(SEQ ID. NO:119) 5 8 

(SEQ ID. NO:120) 6 8 

(SEQ ID. NO:121) 7 8 

(SEQ ID. NO:122) 8 8 

(SEQ ID. NO:123) 9 8 

(SEQ ID. NO:124) 10 8 

(SEQ ID. NO:125) 11 8 

(SEQ ID. N0:126) 12 8 

(SEQ ID. NO:127) 13 8 

(SEQ ID. NO: 128) 14 8 

(SEQ ID. NO:129) 15 8 

(SEQ ID. NO:130) 16 8 

(SEQ ID. NO:131) 0 9 

(SEQ ID. KO:132) 1 9 

(SEQ ID. NO:133) 2 9 

(SEQ ID. NO:134) 3 9 

(SEQ ID. N*ai35) 4 9 

(SEQ ID. N0:136) 5 9 

(SEQ ID. NO:137) 6 9 

(SEQ ID. N0:138) 7 9 

(SEQ ID. NO:139) 8 9 

(SEQ ID. NO:140) 9 9 

(SEQ ID. NO:141) 10 9 

(SEQ ID. NO:142) 11 9 

(SEQ ID. NO:143) 12 9 

(SEQ ID. NO:144) 13 9 

(SEQ ID. NO:145) 14 9 

(SEQ ID. NO:146) 15 9 

(SEQ ID. NO:147) 16 9 

(SEQ ID. NOa48) 0 10 

(SEQ ID. NO:149) 1 10 

(SEQ ID. NO:150) 2 10 

(SEQ ID. NO:151) 3 10 

(SEQ ID. NO:152) 4 10 

(SEQ ID. NO:153) 5 10 

(SEQ ID. N0:154) 6 10 

(SEQ ID. NO:155) 7 10 

(SEQ ID. NO:l56) 8 10 

(SEQ ID. NO:203) 11 13 

(SEQ ID. NO504) 12 13 

(SEQ ID. NO505) 13 13 

(SEQ ID. NO:206) 14 13 

(SEQ ID. NO:207) 15 13 

(SEQ ID. NO208) 16 13 

(SEQ ID. NO:209) 5 14 

(SEQ ID. NO210) 6 14 

(SEQ ID. N0211) 7 14 

(SEQ ID. N0212) 8 14 

(SEQ ID. N0213) 9 14 

(SEQ ID. N0214) 10 14 

(SEQ ID. NO:215) 11 14 

(SEQ ID. N0216) 12 14 

(SEQ ID. N0217) 13 14 

(SEQ ID. N0218) 14 14 

£EQID.N0219) 15 14 

(SEQ ID. KO220) 16 14 

(SEQ ID. NO:221) 5 15 

(SEQ ID. NO:222) 6 15 

(SEQ ID. NO:223) 7 15 

(SEQ ID. NO-.224) 8 15 

(SEQ ID. NO-.225) 9 15 

(SEQ ID. N0^26) 10 15 

(SEQ ID. N0227) 11 15 

(SEQ ID. NO:228) 12 15 

(SEQ ID. N0229) 13 15 

(SEQ ID. NO:230) 14 15 

(SEQ ID. N0231) 15 15 

(SEQ ID. N0232) 16 15 



DL3TTGTTTCTTGGG 

DIJTCTTGGGATTGTG 

DL3TGTATGAATGATTT 

DL3TGATTTCACACAA 

DUCTCTGOGACCTC 

DL3GACCTCGGCCT 

DL3TCGGOCTCGTG 

D L3GATG AAGTCCCAG 

DL3AGTCOCAGTATTT 

D L3GTATTTCGG ATTT 

DL3TCGGATTTATCG 

DL3GATTTATOGGGT 

D L3 ATCGGGTGTGCA 

DL3TGTGCAAGGGGA 

DL3CAAGGGGAATTT 

D L3G AATTTATTCTGTA 

DUTCrGTAGTGCTAC 

D L3GTAGTGCTACCT 

DL3GCTACCTAGTAG 

DL3CTAGTAGTCCAGA 

D L3TCCAO ATA9TGGG 

DLJAGATAGTGGGATA 

D L3GGG ATAATTGGT 

DL3TAATTGGTGAGTG 

DL3TATAGGGCGTGT 

DUGGGCGTGTTCTCA 

DUGTGTTCTCACGAT 

DL3TCACGATGAGAGG 

D UATGAGAGG AGCG 

DL3AGGAGCGAGGC 

DUCGAGGCCCGG 

DL3GCCCGGGTATT 

DL3CGGGTATTGTGA 

DL3GTGAACCCCCAT 

DL3CCCCATCGATTT 

DL3ATCGATTTCACTT 

D 1JTTTCACTTG AC AT . 

DL3TTGACATAGAGCT 

DL3TAGAGCTGTAGAC 

D L3GTAG ACCAAGG A 

DL3ACCAAGGATGAAG 

D L3CGTGTAATGTC AG 

DL3TGTCAGTTTAGGG 

D L3TCAGTTTAGGGA 

DL3TAGGGAAGAGCA 

DL3AAGAGCAGGGGT 

DL3CAGGGGTACCTA 

D UGGTACCTACTGG 

DL3TACTGGGGGGA 

D L3GGGGG AGTCTAT 

DL3CATGTATTTTTGG 

DL3TTTTGGGTTAGG 

D L3GGGTTAGG ATGT 

DUGGATGTAGTTTTG 

DL3TGTAGTTTTGGG 

DL3TTTGGGGGAGG 

DUGGGITCATAACrG 

D U ATAACTG AGTGGG 

DL3AACTGAGTGGGT 

DUGTGGGTAGTTGT 

DL3GTAGTTGTTGGC 

DL3GTTGGCGATACA 

DUCGATACATAAAAG 

DL3TAAAAGCATGTAA 

DUGCATGTAATGACG 

DL3ATGAOGGTCGGT 

DL3GTCGGTGGTACT 

DL3G0TACTTATAACA 

D L3TCGATTCTAAG AT 

DL3TAAGATTAAATTT 

D L3 AAATTTG AATAAG 

DL3AATAAGAGACAAG 

D L3AAGAG ACAAG AAA 

D L3 AAGAAAGTACCC 

DL3AAAGTACCCCTT 

DUCCCCTTCGTCTA 

DL3CTTCGTCTAAAC 

DL3CTAAACCCATGG 

DL3AACCCATGGTGG 

D1JTGGTGGGTTCAT 



(SEQ ID. NO:107) 
(SEQ ID. NO:103) 
(SEQ ID. NO:109) 
(SEQ ID. NO:110) 
(SEQ ID. NO:157) 
(SEQ ID. NO:158) 
(SEQ ID. NO:159) 
(SEQ ID. NO:160) 
(SEQ ID. NO:l6l) 
(SEQ ID. NO:162) 
(SEQ ID. NO:163) 
(SEQ ID. NO:164) 
(SEQ ID. NO:165) 
(SEQ CD. NO:l 66) 
(SEQ ID. NO:167) 
(SEQ ID. NO:168) 
(SEQ ID. NO:169) 
(SEQ ID. NO:I70) 
(SEQ ID. NO:171) 
(SEQ ID. NO:172) 
(SEQ CD. NO:l73) 
(SEQ ID. NO:l74) 
(SEQ ID. NO:175) 
(SEQ ID. NO:176) 
(SEQ ID. NO:177) 
(SEQ ID. N0:178) 
(SEQ ID. NO:179) 
(SEQ ID. NO:180) 
(SEQ ID. NO:181) 
(SEQ ID. NO:l82) 
(SEQ CD. NO:183) 
(SEQ ID. NO:184) 
(SEQ ID. NO:185) 
(SEQ ID. NO:186) 
(SEQ ID. NO:187) 
(SEQ ID. NO:18S) 
(SEQ ID. NO:189) 
(SEQ ID. NO:190) 
(SEQ ID. NO:191) 
(SEQ CD. NO:192) 
(SEQ ID. NO:193) 
(SEQ ID. NO:194) 
(SEQ ID. NO:195) 
(SEQ ID. NO:196) 
(SEQ ID. NO:197) 
(SEQ ID. NO:198) 
(SEQ ID. NO:199) 
(SEQ CD. NO:200) 
(SEQ ID. NO:201) 
(SEQ CD. NO:202) 
(SEQ ID. NO:246) 
(SEQ ID. N0247) 
(SEQ ID. NO:248) 
(SEQ ID. NO-.249) 
(SEQ ID. NO:250) 
(SEQ ID. N0^51) 
(SEQ ID. NO:252) 
(SEQ ID, N0^53) 
(SEQ ID. NO-.254) 
(SEQ CD. NO:255) 
(SEQ ID. NO:256) 
(SEQ ID. N0257) 
(SEQ ID. NO:258) 
(SEQ ID. NO:259) 
(SEQ ID. NO-^60) 
(SEQ ID. NO:261) 
(SEQ ID. NOi62) 
(SEQ ID. N0263) 
(SEQ ID. N0264) 
(SEQ ED. NO:265) 
(SEQ ID. N0^66) 
(SEQ ID. N0^67) 
(SEQ ID. NO:268) 
(SEQ ID. N0269) 
(SEQ ID. KO^70) 
(SEQ ID. NO:27l) 
(SEQ ID. NO:272) 
(SEQ ID. NO:273) 
(SEQ ID. NOS74) 
(SEQ ID. NO-^75) 
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-continued 



10 12 DL3GTCGT1AGTTGG (SEQ ID. NO:233) 5 16 

11 12 DL3TAGTTCGGAGTT (SEQ ID. N"0:234) 6 16 

12 12 DL3GGAGTTOATAGTG (SEQ ID. NO:235) 7 16 

13 12 DUATAGTGTGTAGTT (SEQ ID. NO:236) 8 16 

14 12 DL3GTFIAGTTGACGT (SEQ ID. SO03T) 9 16 

15 12 • DL3TG ACGTTGAGGT (SEQ ID. N0238) 10 16 

16 12 DL30GTTGAGGnTA (SEQ ID. NO:239) 11 16 

5 13 DUTATAACATGCCAT (SEQ ID. NO240) 12 16 

6 13 DL3AACATGCCATGGT (SEQ ID. KO:241) 13 16 

7 13 DUCCATGGTATTAT (SEQ ID. NO:242) 14 16 

8 13 DL3ATTTATGAACTGG (SEQ ID. N0243) 15 16 

9 13 DUAACTGGTGGACAT (SEQ ID. N0^44) 16 16 

10 13 DL3TGGACATCATGTA (SEQ ID. N0245) 



DLJTTGGAAAAAGGT (SEQ ID. NO:27d) 

DL3AAAAGGTTCCTG (SEQ ID. NO:277) 

DL3GGTTCCTG1TTA (SEQ ID. NO:278) 

DL3CCTGnTAGTCrC (SEQ ID. NO:279) 

DUTTAGTClOTnT (SEQ ID. NO:280) 

DL3CTTTTTCAGAAAT (SEQ ID. NO:281) 

D L3 AGAAATTG AGGTG (SEQ ID. NO:282) 

DL3AAATTGAGGTGGT (SEQ ID. NO:283) 

DL3GGTGGTAATCGT (SEQ ID. NO:284) 

DL3TAATCGTGGGTT (SEQ ID. NO:285) 

DL3GTGGGTITCGAT (SEQ ED. NO:286) 

DL3GGTTTCGATTCT (SEQ ID. NO:287) 



No probes were present in positions X, Y-0, 12 to X, Y-4, and in several cases, the differences were within noise levels. 

12; X, 13 to X, Y-4, 13; X, Y-0, 14 to X, Y-4, 14; X, Improvements can be realized by increasing the amount of 

Y-0, 15 to X, Y-4, 15; X, Y-0, 16 to X, Y-4, 16; The length overlap between probes and hence overall probe density 

of each of the probes on the chip was variable to minimize and, for duplex DNA targets, using a second set of probes, 

differences in melting temperature and potential for cross- either on the same or a separate chip, corresponding to the 
hybridization. Each position in the sequence is represented 20 second strand of the target. FIG. 14, in sheets 1 and 2, shows 

by at least one probe and most positions are represented by a plot of normalized intensities across rows 10 and 11 of the 

2 or more probes. As noted above, the amount of overlap array and a tabulation of the mutations detected, 

between the oligonucleotides varies from probe to probe. FIG. 15 shows the discrimination between wild-type and 

FIG. 9 shows the human mitochondrial genome; "<V » & c mulant hybrids obtained with this chip. The median of the 
H strand origin of replication, and arrows indicate the cloned 25 six normalized hybridization scores for each probe was 

unshaded sequence. taken. The graph plots the ratio of the median score to the 

DNA was prepared from hair roots of six human donors normalized hybridization score versus mean counts. On this 

(mtl to mt6) and then amplified by PCR and cloned into graph, a ratio of 1.6 and mean counts above 50 yield no false 

M13; the resulting clones were sequenced using chain positives, and while it is .clear that detection of some mutants 
terminators to verify that the desired specific sequences were 30 can be improved exceUent discnminaUon ^ acbeved coo- 

present. DNAfrom the sequenced M13 clones was amplified sidermg the small size of the array. FIG. J6 Ute rates ; tow 

f „ "„ , . 7 . j i t_ i j o ■ the identity of the base mismatch may influence the ability 

^ CR ' *^™f Vltr °' ^^^Sa^ — k ! dSnate mutant and wild4ype Veque^ more thai 

UTPismgT3RNApol^ £ e q[ me mismatch an oligonucleotide 

were fragmented I an I hybridized to the chip. The ^results * ^ mismatch position is expressed as % of probe 

showed that each different individual had DNA that pro- 35 * [mm ^ y ^ Qd ^ base chaQge fe on the 

duced a unique hybridization fingerprint on the chip and that fa nsu{{s show that ^ DNA chip in Cre ases the 

the differences in the observed patterns could be correlated ca p ac i t y of the standard reverse dot blot format by orders of 

with differences in the cloned genomic DNA sequence. The magnitude, extending the power of that approach many fold 

results also demonstrated that very long sequences of a ^ mat ^ methods of the invention are more efficient and 

target nucleic acid can be represented comprehensively as a 40 caS j er to automate than gel-based methods of nucleic acid 

specific set of overlapping oligonucleotides and that arrays sequence and mutation analysis. 

of such probe sets can be usefully applied to genetic analy- These advantages become more apparent as chips with 

sis. more and more probes are employed. To illustrate, the 

The sample nucleic acid was hybridized to the chip in a present invention provides a DNA chip for analyzing human 

solution composed of 6xSSPE, 0.1% Triton-X 100 for 60 45 mitochondrial DNA (mtDNA) that "tiles" through 648 

minutes at 15° C. The chip was then scanned by confocal nucleotides of human H strand mtDNA from positions 

scanning fluorescence microscopy. The individual features 16280 to 356. The probes in the array are 15 nucleotides m 

on the chip were 588x588 microns, but the lower left 5x5 length and each position in the target sequence is repre- 

square features in the array did not contain probes. To semed by a set of 4 probes ^ GJsuta^ 
Santitate the data pixel " ™ ^J*^ » 

synthesis site Pixels represen 50x50 microns The fluores- J 5Q of mlDNAsequC nce. The 

cence intensity for each feature was scaled to a mean ^ afC tcd b blank rows . 4 comer columns 

determined from 27 bright features. After scanning, the chip bcs; thcfC m a total of 2m pfo5cs ^ a 

was stripped and rehybndized; all six samples were hybrid- ^ cmxl ^ 8 cm mA (feature), and each area is 

ized to the same chip. FIG. 10 shows the image observed 55 256x197 microns. 

from the mt4 sample on the DNA chip. FIG. 11 shows the Labeled RNA target DNA was prepared by PCR ampli- 

im age observed from the mt5 sample on the DNA chip. FIG. fication of a 13 kb region of human mtDNA spanning 

12 shows the predicted difference image between the mt4 positions 15935 to 667, cloning into M13 (sequence verifi- 

and mt5 samples on the DNA chip based on mismatches cation was performed), and reamplification of the cloned 

between the two samples and the reference sequence (see 60 sequences using primers tagged with T3 and T7 RNA 

Anderson et al f 1981, Nature 290: 457-465, incorporated polymerase promoter sequences and in vitro transcription to 

herein by reference). FIG. 13 shows the actual difference produce fluorescein-UTP labeled RNA. The RNA was frag- 

image observed. mented and hybridized to the oligonucleotide array in a 

The results show that, in almost all cases, mismatched solution composed of 6xSSPE, 0.1% Triton X-100 for 60 

probe/target hybrids resulted in lower fluorescence intensity 65 minutes at 18° C. Unhybridized material was washed away 

than perfectly matched hybrids. Nonetheless, some probes with buffer, and the chip was scanned at 25 micron pixel 

detected mutations (or specific sequences) better than others, resolution. 
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FIG. 17 provides a 5* to 3' sequence listing of one target 
corresponding to the probes on the chip. X is a control probe. 
Positions that differ in the target (i.e., are mismatched with 
the probe at the designated site) are in bold, FIG. 18 shows 
the fluorescence image produced by scanning the chip when 
hybridized to this sample. About 95% of the sequence could 
be read correctly from only one strand of the original duplex 
target nucleic acid. Although some probes did not provide 
excellent discrimination and some probes did not appear to 
hybridize to the target efficiently, excellent results were 
achieved. The target sequence differed from the probe set at 
six positions: 4 transitions and 2 insertions. All 4 transitions 
were detected, and specific probes could readily be incor- 
porated into the array to detect insertions or deletions. FIG. 
19 illustrates the detection of 4 transitions in the target 
sequence relative to the wild-type probes on the chip. 

These results illustrate that longer sequences can be read 
using the DNA chips and methods of the invention, as 
compared to conventional sequencing methods, where read- 
ing length is limited by the resolution of gel electrophoresis. 
Similar results were observed when genomic DNA samples 
were prepared from human hair roots. Hybridization and 
signal detection require less than an hour and can be readily 
shortened by appropriate choice of buffers, temperatures, 
probes, and reagents. In principle, longer sequence reads can 
be obtained than by conventional sequencing, where reading 
length is limited by the resolution of gel electrophoresis. 
P53 Sequencing and Diagnostic DNA Chips 
P53 is a tumor suppressor gene that has been found to be 



between the particular mutation in p53 and the functioning 
of the resulting protein. Furthermore, there are projects 
looking at the germline inheritance of p53 mutations and the 
development of cancer. The present invention provides 
5 useful DNA chips and methods for such studies. 

In. addition, the present invention also provides a diag- 
nostic test kit and method and p53 probes immobilized on a 
DNA chip in an organized array. Currently available diag- 
nostic tests for cancer typically have a sensitivity of about 
10 50%. The present invention provides significant advantages 
over such tests, and in one embodiment provides a method 
for detecting cancer-causing mutations in p53 that involves 
the steps of (1) obtaining a biopsy, which is optionally 
fractionated by cryostat sectioning to enrich tumor cells to 
15 about 80% of the total cell population. The DNA or RNA is 
then extracted, amplified, and analyzed with a DNA chip for 
the presence of p53 mutations correlated with malignancy. 

To illustrate the value of the DNA chips of the present 
invention in such a method, a DNA chip was synthesized by 
20 the VLSIPS™ method to provide an array of overlapping 
probes which represent or tile across a 60 base region of 
exon 6 of the p53 gene. To demonstrate the ability to detect 
substitution mutations in the target, twelve different single 
substitution mutations (wild type and three different substi- 
25 rutions at each of three positions) were represented on the 
chip along with the wild type. Each of these mutations was 
represented by a series of twelve 12-mer oligonucleotide 
probes, which were complementary to the wild type target 
except at the one substituted base. Each of the twelve probes 



mutated in most forms of cancer (see Levine et al, 1991, 30 was complementary to a different region of the target and 



Nature 351: 453-456, and Hollstein et al., 1991, Science 
253: 49-53, each . of which is incorporated herein by 
reference). In addition, there is a hereditary syndrome, 
Li-Fraumeni, in which individuals inherit mutant alleles of 
p53 and tend to have cancer at relatively young ages 
(Frebourg et al., 1992, PNAS 89: 6413-6417, incorporated 
herein by reference). During the development of a cancer, 
p53 is inactivated. The course of p53 inactivation generally 
involves a mutation in one copy of p53 and is often followed 



contained the mutated base at a different position, e.g., if the 
substitution was at base 32, the set of probes would be 
complementary-with the exception of base 32 — to regions 
of the target 21-32, 22-33, and 32-43). This enabled inves- 
35 ligation of the effect of the substitution position within the 
probe. The alignment of some of the probes with a 12-mer 
model target nucleic acid is shown in FIG. 20. 

To demonstrate the effect of probe length, an additional 
series of ten 10-mer probes was included for each mutation 



by deletion of the other copy. After p53 is inactivated, 40 (see HO. 21). In the vicinity of the substituted positions, the 



chromosomal abnormalities begin to appear in tumors. In 
the best understood form of cancer, colorectal cancer, well 
over 50%, perhaps 80%, of all patients with tumors have p53 
mutations. In addition, p53 mutations have been found in a 
high proportion of lung, breast, and other tumors (Rodrigues 
et al., 1990, PNAS 87: 7555-7559, incorporated herein by 
reference). According to data presented by David Sidransky 
(1992 San Diego Conference), over 400 mutations in p53 are 
known. 

The p53 gene spans, 20 kbp in humans and has 11 exons, 
10 of which are protein coding (see Torainaga et al, 1992, 
Critical Reviews in Oncogenesis 3: 257-282, incorporated 
herein by reference). The gene produces a 53 kilodalton 
phosphoprotein that regulates DNA replication. The protein 
acts to halt replication at the Gl/S boundary in the cell cycle 
and is believed to act as a "molecular policeman,'' shutting 
down replication when the DNA is damaged or blocking the 
reproduction of DNA viruses (see Lane, 1992, Nature 358: 
15-16, incorporated herein by reference). There is substan- 



wild-type sequence was represented by every possible over- 
lapping 12-mer and 10-mer probe. To simplify comparisons, 
the probes corresponding to each varied position were 
arranged on the chip in the rectangular regions with the 
45 following structure: each row of cells represents one 
substitution, with the top row representing the wild type. 
Each column contains probes complementary to the same 
region of the target, with probes complementary to the 
3*-end of the target on the left and probes complementary to 
50 the 5'-cnd of the target on the right The difference between 
two adjacent columns is a single base shift in the positioning 
of the probes. Whenever possible, the series of 10-mer 
probes were placed in four rows immediately underneath 
and aligned with the 4 rows of 12-mer probes for the same 
55 mutation. 

To provide model targets, 5' fluoresceinated 12-mers 
containing all possible substitutions in the first position of 
codon 192 were synthesized (see the starred position in the 
target in FIG. 20). Solutions containing 10 nM target DNA 



tial interest in the cancer research community in analyzing 60 in 6xSSPE, 0.25% Tnton X-100 were hybndized to the chip 

p53 mutations. The NCI is currently funding contracts to at room temperature for several hours. While target nucleic 

characterize the p53 mutation spectra caused by various was hybridized to the chip, the fluorophores on the chip were 

carcinogens. In addition, there are research projects which excited by light from an argon laser, and the chip was 

involve sequencing p53 from spontaneously arising tumors. scanned with an autofocusing confocal microscope. The 

A major resource in these studies is the huge supply of 65 emitted signals were processed by a PC to produce an image 

biopsy material stored in paraffin blocks. Also, there are using image analysis software. By 1 to 3 hours, the signal 

projects which are aimed at analyzing the relationship had reached a plateau; to remove the hybridized target and 
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allow hybridization to another target, the chip was stripped 
with 60% formamide, 2xSSPE at 17° C. for 5 minutes. The 
washing buffer and temperature can vary, but the buffer 
typically contains 2-to-3xSSPE, 10-to-60% formamide (one 
can use multiple washes, increasing the formamide concen- 
tration by 10% each wash, and scanning between washes to 
determine when the wash is complete), and optionally a 
small percentage of Triton X-100, and the temperature is 
typically in the range of 15° to 18° C. 

Very distinct patterns were observed after hybridization 
with targets with 1 base substitutions and visualization with 
a confocal microscope and software analysis, as shown in 
FIG. 22. In general, the probes which form perfect matches 
with the target retain the highest signal. For example, in the 
first image in Figure PC, the 12-mer probes that form perfect 
matches with the wild-type (WT) target are in the first row 
(top). The 12-mer probes with single base mismatches are 
located in the second, third, and fourth rows and have much 
lower signals. The data is also depicted graphically in FIG. 
23. On each graph, the X ordinate is the position of the probe 
io its row on the chip, and the Y ordinate is the signal at that 
probe site after hybridization. 

When a target with a different one base substitution is 
hybridized the complementary set of probes has the highest 



For sequencing, the p53 DNA can be cloned from the 
sample or directly amplified from genomic DNA by PCR. If 
genomic PCR is used, then the DNA can be diluted prior to 
amplification so that a single copy of the gene is amplified. 

5 For diagnostic purposes, the genomic DNA can be isolated 
from a tumor biopsy in which the tumor cells may be the 
majority population. As noted above, the proportion of 
tumor cells in a sample can be enriched by cryostat section- 
ing. DNA can also be isolated and amplified from tumor 

10 samples stored in paraffin blocks. 

The p53 DNA in the sample can be amplified by PCR 
(although other amplification methods can be used) using 
3-4 primer pairs generating amplicons of <3 kbp each. 
Illustrative primers of the invention for amplifying exon 5 of 

15 the p53 gene are shown below (B is biotin; F is fluorescein). 
5^B-CACTTGTGCCCTGACTTTCAAC-3'(SEQ. ID 

NO:288) 

S'-F-CAOTGTGCCCTGACTTrCAACO' 
S'-ATGCAATTAACCCrCACTAAAGGGAGACACTTG- 
20 TGCCCTGACTrcCAAC-3*(SEQ. ID NO:289) (has 13 
promoter) 

5*.B-GACCCrGGGCAACCAGCCCrGTCGT-3'(SEQ. ID 
NO:290) 

S'-F-GACCCTGGGCAACCAGCCCTGTCGT-^ 



signal (see pictures 2, 3, and 4 in FIG. 22 and graphs 2, 3, 25 S'-TAATACGACTCACTATAGGG^G^ 
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and 4 in FIG. 23). In each case, the probe set with no 
mismatches with the target has the highest signals. Within a 
12-mer probe set, the signal was highest at position 6 or 7. 
The graphs show that the signal difference between 12-mer 
probes at the same X ordinate tended to be greatest at 
positions 5 and 8 when the target and the complementary 
probes formed 10 base pairs and 11 base pairs, respectively. 
Because tumors often have both WT and mutant p53 genes, 
mixed target populations were also hybridized to the chip, as 
shown in FIG. 24. When the hybridization solution consisted 
of a 1:1 mixture of WT 12-mer and a 12-mer with a 
substitution in position 7 of the target, the sets of probes that 
were perfectly matched to both targets showed higher sig- 
nals than the other probe sets. 

The hybridization efficiency of a 10-mer probe array as 
compared to a 12-mer probe array was also compared. The 
10-mer and 12-mer probe arrays gave comparable signals 
(see graphs 1-4 in FIG. 23 and graphs 1-4 in tlG. 25). 
However, the 10-mer probe sets, which are in rows 5-S (see 
images in FIG. 22), seemed to be better in this model system 
than the 12-mer probe sets at resolving one target from 
another, consistent with the expectation that one base mis- 
matches are more destabilizing for 10-mers than 12-mers. 
Hybridization results within probe sets perfectly matched to 
target also followed the expectation that, the more matches 50 
the individual probe formed with the target, the higher the 
signal. However, duplexes with two 3' dangles (see FIG. 23, 
position 6 in graphs 1-4) have about as much signal as the 
probes which are matched along their entire length (see FIG. 
23, position 7, in graphs 1-4). 

Tnis illustrative model system shows that 12-mer targets 
that differ by one base substitutions can be readily distin- 
guished from one another by the novel probe array provided 
by the invention and that resolution of the different 12-mer 
targets was somewhat better with the 10-mer probe sets than 
with the 12-mer probe sets. The value of having several 
overlapping probes hybridizing to a target demonstrates the 
value of the multiple hybridization events that take place on 
a DNA chip of the invention. The results also demonstrate 
the feasibility of constructing a probe set to sequence the 
entire 1.4 kbp protein coding region of p53 or alternatively 
the 0.6 kbp of exons 5-9 containing mutation hot spots. 
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ACCAGCCCTGTCGT-3'(SEQ. ID NO:291) (has T3 
promoter) 

After PCR amplification of the target (the amplified target is 
called the "amplicon") one strand of the amplicon can then 
be isolated, i.e., using a biotinylated primer that allows 
capture of the undesired strand on streptavidiri beads. 
Alternatively, asymmetric PCR can be used to generate a 
single-stranded target. Another approach involves the gen- 
eration of single stranded RNA form the PCR product by 
incorporating a T7 or other RNA polymerase promoter in 
one of the primers. The single-stranded material can option- 
ally be fragmented to generate smaller nucleic acids with 
less significant secondary structure than longer nucleic 

acids. b 

In one such method, fragmentation is combined with 
labeling. To illustrate, degenerate 8-mers or other degenerate 
short oligonucleotides are hybridized to the single-stranded 
target material. In the next step, a DNA polymerase is added 
with the four different dideoxynucleotides, each labeled with 
a different fluorophore. Fluorophore-labeled dideoxynucle- 
otide are available from a variety of commercial suppliers, 
such as ABI. Hybridized 8-mers are extended by a labeled 
dideoxynucleotide. After an optional purification step, Le., 
with a size exclusion column, the labeled 9-mers are hybrid- 
ized to the chip. Other methods of target fragmentation can 
be employed. The single-stranded DNA can be fragmented 
by partial degradation with a DNAse or partial depurination 
with acid. Labeling can be accomplished in a separate step, 
i.e., fluorophore-labeled nucleotides are incorporated before 
the fragmentation step or a DNA binding fluorophore, such 
as ethidium homodimer, is attached to the target after 
fragmentation. 

In one embodiment, the DNA chip has an array of 10 to 
10* probes tiling across the protein coding regions of p53, 
which comprise about 1200 bp; smaller arrays specific for 
the 600 bp mutational hot spot region are also useful. The 
probes overlap for N-2 to N-4 bases, where N is the length 
of the probe in bases. N is typically 10 to 14 bases long, but 
as will be seen below, probes 15 to 19 bases and longer are 
also useful. Every possible single base substitution occur- 
ring one at a time is represented in the array. The number of 
unique 10-mer probes with 7 base overlaps would be about 
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(1200/3)x4xl0 or about 1.6xl0 4 . To allow 3 replicates of of DNA First, the target DNA is amplified by PCR with 

each probe, one might have a total array size on the order of primers allowing easy ligation into a vector, which is taken 

4.8xl0 4 probes. Of course, arrays of probes within the up by transformation of E. coli which in turn must be 

ranges of 10 2 to Iff probes are also useful for applications; cultured, typically on plates overnight, ^ growth of the 

for example, very large arrays of 10* or more probes are 5 bacteria DNA ^ purified in a procedure that typically takes 

• J% c - - ■ ^u^^uJl „*n«m^ about 2 hours; then, the sequencing reactions are performed, 

useful for sequencing or sequence checking large genomic * ^^J* ^ * d the samples P are ran on 

DNA fragments. Opuonal y ^f^J^^^ the gel for several hours, the duration depending on the 
nucleic acid hybridized to the chip is detected by a confocal * Qf ^ fr Qt t0 ^ scqucnccd . B y contrast, the 
microscope or other imaging device. The pattern of sites Q% providcs dircc t analysis of the PCR ampli- 
"lighting up" with target is preferably analyzed with com- 10 fiwJ matcrial aftcr brief • transcription and fragmentation 
puter assistance to provide the sequence of the target from steps> days of timc and labor 
the pattern of sites producing signals. ^ interesting clinical application for the characterization 
The invention is illustrated below with examples ofDNA of DCtcrozygous mutations with DNA chips is as follows, 
chips comprising very large arrays of DNA probes to "rese- i n aivlduals with germline cancer mutations have a very high 
quence" p53 target nucleic acid in a sample. To analyze is risk fof lumors af ter treatment by irradiation. 
DNA from exon 5 of the p53 tumor suppressor gene, a set Aboul lQ% of all cancer pal ; ents may have germline muta- 
of overlapping 17-mer probes was synthesized on a chip. for p53 Qr othcf mmor suppressor geries . Thus, before 
The probes for the WT allele were synthesized so as to tile dec i dmg on a treatment modality, a physician could use the 
across the entire exon with single base overlaps between mclno d and DNA chips of the invention to test for a 
probes. For each WT probe, a sets of 4 additional probes, 20 gcrmlinc SU p prcssor gene mutation, 
one for each possible base substitution at position 7, were DNA Cfaips for Ratioaa i Therapeutic Management 
synthesized and placed in a column relative to the WT probe. Tfac prcscnt invention also provides DNA chips that can 
Exon 5 DNA was amplified by PCR with primers flanking fee ^ by phys i cians l0 determine optimum therapeutic 
the exon. One of the primers was labeled with fluorescein; protocols by early, rapid detection of biologically mediated 
the other primer was labeled with biotin. After amplification, 25 resislancc t0 a therapeutic agent in a variety of disease states, 
the biotinylated strand was removed by binding to strep ta- Jhc of such DNA chips are many, as the chips wili 
yidin beads. The fluoresceinated strand was used in hybrid- he]p physicians rec ognize health care cost savings, achieve 
ization. , rapid therapeutic benefits, limit administration of ineffective 
About l A of the amplified, single-stranded nucleic acid , due tQ (he resistancc j yct tox i c dmgs> monitor changes in 
was hybridized overnight in SxSSPE at 60° C. to the probe 30 pathogen resistance, and decrease pathogen acquisition of 
chip (under- a cover slip). After washing with 6xSSPE, the resistance. Important applications include the treatment of 
chip was scanned using confocal microscopy. FIG. 26 shows HIV > Qther ^fc^us diseases, and cancer, 
an image of the p53 chip hybridized to the target DNA. HIV has infected a large and expanding number of people, 
Analysis of the intensity data showed that 93.5% of the 184 rcsulting m massive health care expenditures, HIV can 
bases of exon 5 were called in agreement with the WT 35 fapi(Jly b6Come resistant to drugs used to treat the infection, 
sequence (see Buchman et al., 1988, Gene 70: 245-252, pr i mar iiy due to the action of the heterodimeric protein (51 
incorporated herein by reference). The miscalled bases were ^ and 66 ^ HIV revcrsc transcriptase (RT) encoded by 
from positions where probe signal intensities were tied (he 17 kb pol gene ^ high error rate (5_ 10 per round ) Q f 
(1.6%) and where non-WT probes had the highest signal ^ RTpr otein ^ believed to account for the hypermutability 
intensity (4.9%). FIG. 27 illustrates how the actual sequence 40 of Hiy ^ DUcleos i de analogues, i.e., ACT, ddl, ddC, and 
was read. Gaps in the sequence of letters in the WT rows d4T ^ comm0 aly used to treat HIV infection are converted to 
correspond to control probes or sites. Positions at which auc k ot ye analogues by sequential phosphorylation in the 
bases are miscalled are represented by letters in italic type in cytoplasm of infected alls, where incorporation of the 
cells corresponding to probes in which the WT bases have ana i 0 gue into the viral DNA results in termination of viral 
been substituted by other bases. 45 rep ij cat i 0Qj because the 5'-*3' phosphodiester linkage can- 
As the diagram indicates, the miscalled bases are from the nQt ^ comp ie lcd . However, within after 6 months to 1 year 
low intensity areas of the image, which may be due to of t^^^ H IV typically mutates the RT gene so as to 
secondary structure in the target or probes preventing inter- become incapable of incorporating the analogue and so 
molecular hybridization. To diminish the effects due to resistant to treatment. Several known mutations are shown 
secondary structure, one can employ shorter targets (i.e., by 50 jn Uou i ar f orm below. . 
target fragmentation) or use more stringent hybridization 

conditions. In addition, the use of a set of probes synthesized ; : _ ; — — ; 

by tiling across the other strand of a duplex target can also kt mutations associated wtth drug resistance 
provide sequence information buried in secondary structure 

in the other strand. It should be appreciated, however, that 55 
the pattern of low intensity areas that forms as a result of 
secondary structure in the target itself provides a means to 
identify that a specific target sequence is present in a sample. 
Other factors that may contribute to lower signal intensities 

include differences in probe densities and hybridization 60 
stabilities. 

These results demonstrate the advantages provided by the 

DNA chips of the invention to genetic analysis. As another 

example, heterozygous mutations are currently sequenced N B othcr mutat i 0QS con f er resUtaace to other drugs in vitro 

by an arduous process involving cloning and repurification 65 > 

of DNA. The cloning step is required, because the gel The present invention provides DNA chips for detecting 

sequencing systems are poor at resolving even a 1:1 mixture the multiple mutations in the HIV RT gene associated with 
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resistance to different therapeutics. These DNA chips will 
enable physicians to monitor mutations over time and to 
change therapeutics if resistance develops. The DNA chip 
will provide redundant confirmation of conserved HIV RT 
and other gene sequences, and the probes on the chip will tile 
through, with overlap, in important mutational hot spot 
regions. The chip will optionally have probes that span the. 
entire coding region of the RT and optionally the genes for 
other HIV proteins, such as coat proteins. HIV target nucleic 



to gain primary structure information of the DNA target. 
This format has important applications in sequencing by 
hybridization, DNA diagnostics and in elucidating the ther- 
modynamic parameters affecting nucleic acid recognition. 

Conventional DNA sequencing technology is a laborious 
procedure requiring electrophoretic size separation of 
labeled DNA fragments. An alternative approach, termed 
Sequencing By Hybridization (SBH), has been proposed 
(LysovctaL, 198S,Dokl.Akad. NaukSSSR 303: 1508-1511; 



AMPLIFICATION OF TARGET 



TARGET 
SIZE 

1, 742bp 
535bp 
323bp 



GTAGAATTCTGTTCACTCAGATTGG 
(SEQ ID. K0292) 

AAATCCATACAATACTCCAGTATTTGC 

(SEQ ID. N029J) 
Genbank#K020l3 18S9-190S 



PRIMER 2 

OATAAGCTTGGGCCTrATCTATTCCAT 

(SEQ ID. NO:294) 

ACCCATCCAAAGGAATGGAGGTTtnTTC 

(SEQ ID. NO:295) 
bases 2211-2192 
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The HIV RTgene chips of the invention, as well as the CF, 
mtDNA, and p53 DNA chips of the invention, illustrate the 
diverse application of the methods and probe arrays of the 
invention. The examples that follow describe methods tor 
preparing nucleic acid targets from samples for application M 
to the DNA chips of the invention and provide additional 
details of the methods of the invention. 

EXAMPLES 

I. VLSIPS™ Technology 

As noted above, the VLSIPS™ technology is described in 35 
a number of patent publications and is preferred for making 
the oligonucleotide arrays of the invention. For 
completeness, a brief description of how this technology can 
be used to make and screen DNA chips is provided in this 
Example and the accompanying Figures. In the VLSIPS 40 
method, light is shone through a mask to activate functional 
(for oligonucleotides, typically an — OH) groups protected 
with a photoremovable protecting group on a surface of a 
solid support After light activation, a nucleoside building 
block, itself protected with a photoremovable protecting 4S 
group (at the 5' — OH), is coupled to the activated areas of 
the support. The process can be repeated, using different 
masks or mask orientations and building blocks, to prepare 
very dense arrays of many different oligonucleotide probes. 
The process is illustrated in FIG. 28; FIG. 29 illustrates how 50 
the process can be used to prepare "nucleoside combinato- 
rials" or oligonucleotides synthesized by coupling all four 
nucleosides to form dimers, trimers, etc. 

New methods for the combinatorial chemical synthesis of 
peptide, polycarbamate, and oligonucleotide arrays have 55 
recently been reported (see Fodor et al., 1991, Science 251: 
767-773; Cho et al, 1993, Science 261: 1303-1305; and 
Southern et al., 1992, Genomics 13: 1008-10017, each of 
which is incorporated herein by reference). These arrays, or 
biological chips (see Fodor et al., 1993, Nature 364: 60 
555-556, incorporated herein by reference), harbor specific 
chemical compounds at precise locations in a high-density, 
information rich format, and are a powerful tool for the 
study of biological recognition processes. A particularly 
exciting application of the array technology is in the field of 65 
DNAsequence analysis. The hybridization pattern of a DNA 
target to an array of shorter oligonucleotide probes is used 



oligonucleotide probes of defined sequence to search for 
complementary sequences on a longer target strand of DNA. 
The hybridization pattern is used to reconstruct the target 
DNA sequence. It is envisioned that hybridization analysis 
of large numbers of probes can be used to sequence long 
stretches of DNA. In immediate applications of this hybrid- 
ization methodology, a small number of probes can be used 
to interrogate local DNA sequence. . 

The strategy of SBH can be illustrated by the following 
example. A 12-mer target DNA sequence, 
AGCCTAGCTGAA, (SEQ. ID NO:296) is mixed with a 
complete set of octanucleotide probes. If only perfect 
complementarity is considered, five of the 65,536 octamer 
probes -TCGGATCG, CGGATCGA, GGATCGAC, 
GATCGACT, and ATCGACTT will hybridize to the target. 
Alignment of the overlapping sequences from the hybridiz- 
ing probes reconstructs the complement of the original 
12-mer target: 



TCGGATCG 
CGGATCGA 
GGATCGAC 
GATCGACT 
ATCGACTT 
TCGGATCGACTT (SEQ. ID NO:297) 

Hybridization methodology can be carried out by attaching 
target DNA to a surface; The target is interrogated with a set 
of oligonucleotide probes, one at a time (see Strezoska et al., 
1991* Pro* Natl Acaa\ ScL USA 88: 10089-10093, and 
Drmanac et al., 1993, Science 260: 1649-1652, each of 
which is incorporated herein by reference). This approach 
can be implemented with well established methods of immo- 
bilization and hybridization detection, but involves a large 
number of manipulations. For example, to probe a sequence 
utilizing a full set of octanucleotides, tens of thousands, of 
hybridization reactions must be performed. Alternatively, 
SBH can be carried out by attaching probes to a surface in 
an array format where the identity of the probes at each site 
is known. The target DNA is then added to the array of 
probes. The hybridization pattern determined in a single 
experiment directly reveals the identity of all complemen- 
tary probes. 
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As noted above a preferred method of oligonucleotide oftheprobeswiUgeneratedetectablesignals Modifying the 

prot a£a ^ S7th° use of lighuo direct the above expression for N, one ^es at a rela nonsbp «n- 

syXesU of oligonucleotide probes in bigh-density, minia- mating the number of detectable^ hybrid* ^onsJTd) fi r. 

turized arrays Photolabile S'-protected N-acyl- DNA target of length Lt and an array of complexity C. 

deoxynucTeoTde ptsphoramidit'es, surface linL 5 Assuming an ^average of ' *^ ■"»« "P* *"» 

chemistry, and versatile combinatorial synthesis strategies background: NdKl^C^M"^^- ■ ■ „„„_,,., bV 

have been developed for this technology. Matrices of Arrays of oligonucleotides can be efficiently -generated by 

Sail defiS ^oligonucleotide probes have been light-directed synthesis and can be used to determine the 

Era ted and the Sy to use these arrays to identify identity of DNA target sequences. Because combinatory 

commentary LoZL has been demonstrated by 10 strategies are used, the number of compounds increases 

SSXL^ut labeled oligonucleotides to the DNA exponentially while the number of c hemical <™P^ ^es 

effps pTdu«d by the methods. The hybridization pattern increases only linearly 

demotes .hi. - ™* SStiZTi 

the? (1? "ouS to HO. 28. The surface of a solid be implemented to generate arrays of any desired eompog- 

supToi modifiedS photolabile protecting groups (X) is tion. For example, because the en « set of dodecamers 4 ) 

mEated through a photolithographic mask, yielding can be produced in 48 P** 1 ^^ <J 

Krtto hydroxyl grouj in me illuminated regions. A compounds requires bxn cycles) ^any subset 

".O-phosXVamidiTe activated deoxynucleoside (protected 20 ers (including any subset of shorter oL S onueI ~^ d «) c ? n ° 

at Se ?3S S a photolabile group) is then presented constructed with the correct lithographic mask- desig £ in 48 

££suScl^ or fewer chemical coupling stepson, 

r • . • • . *\ j rcauircs only a 2.5 o cm array. 

SfdSecte J chemical synthesis lends itself to highly 30 Genome sequencing project will ultimately be limited by 

effiden sSel sirgTeswluchvvm generate a maximum DNAsequencingtechnolog.es. Current sequencing method- 

numS ofconTounl to a minimum lumber of chemical ologies are highly reliant ot i complex proceduresand I r^uire 

stenTFor example the complete set of 4n polynucleotides substantial manual effort. Sequencing by bybndiza ion has 

S'th n) orZ subfet ofTL set can be produced in only the potential for transforming many of the manual efforts 

4« chemical eps Se FIG. 29. The patterns of Alumina- 35 into more efficient and automated formats Ugh|*«cted 

anSX chemical reactants ultimately define the synthesis is an efficient meansfor large scale produc ,on of 

Sc B anS Uheir locations. Because photolithography is miniaturized arrays for SBH. The °^°™^>^£ 

Led, the process can be miniaturized to generate high- not limited to primary sequencing applicat ons. Because 

Sty arrays of oligonucleotide probes. For an example of single base changes cause ^ 

me nomenclature usfful for describing such arrays, an array 40 ization V^^^^^'^^.^bZ 

containing all possible octanucleotides of dA and dT is ful means to check the accuracy ^of P"^*^™ 

written as rA+TV Expansion of this polynomial reveals the DNA sequence, or to scan for changes within a sequence, to 

" ■ wnivvv a dna arrav cornoosed of complete sets of DNA results in the loss of eight complements, and generates 

d1n^SXt?e^dto« SfcompTexS of 2. The 45 eight new complements Matching of hybridization pattern, 

«rav riven £ (A+T+C*G)8 is the full 65.536 octanucle- may be useful in resolving «quencuig amb.gu.Ues from 

array given °YK** "X/tZr K standard gel techniques, or for rapidly detecting DNA muta- 

° U T : S ourhybStion of DNA targets to the probe tional events. The potentially very high information content 

arrays T aaay^ are mlted in . thernUtaticall/con- of light-directed oligonucleot.de arrays wdl cnange g«ehc 

uX hybridation chamber. Fluorescein labeled DNA 50 diagnostic testing. Sequence 

toxeete are toiected into the chamber and hybridization is thousands of different genes will be assayed simulUnwusly . 

Ed to piSed for * to 2 hours. The "surface of the instead of the current ^'^S^SS' 

matrix is scanned in an epifluorescence microscope (Zeiss ^"^^"^^^f^n^SSS 

Axioscop 20) equipped with photon counting electronics the rapid identification of a wide vanety of pathogenic 

usine 50-100 «W of 488 nm excitation from an Argon ion 55 organisms. 

Sipectra Physics model 2020). All measurements are ^^f^^^^A ££££ 

acquit with th/target «^«~« a ^"*- SZS^^'JSS^S^ 

SSf S? S3K Tn^i^ ^- S« SPSSSt oligonucleotide -™ c rj0 tovesu- 

semea aiwr cooveraiuu a ^ ^ ^ of noyel syntnetlc nucleoside analogs for 

Shen hybridizing a DNA target to an oligonucleotide antisense or triple helix •Pf^^JJ^ 1 ^ 
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used routinely today for oligonucleotide synthesis. FIG. 30 
shows the deprotection, coupling, and oxidation steps of a 
solid phase DNA synthesis method. FIG. 31 shows an 
illustrative synthesis route for the nucleoside building blocks 
used in the method. FIG. 32 shows a preferred photorcmov- 5 
able protecting group, MeNPOC, and how to prepare the 
group in active form. The procedures described below show 
how to prepare these reagents. The nucleoside building 
blocks are 5'-MeNPOC-THYMIDINE-3'-OCEP; 
5'-MeNPOC-N 4 -t-BUTYL PHENOXYACETYL- 10 
DEOXYCYTIDINE-3'-OCEP; 5'-MeNPOC-N 4 -t-BUTYL 
PHENOXYACETYL-DEOXYGUANOSINE-3'-OCEP; 
and 5'-MeNPOC-N 4 -t-BUTYL PHENOXYACETYL- 
DEOXYADENOSINEO'-OCEP. 

A. Preparation of 4, 5-methylenedioxy-2-nitroacetophenone 15 



minimum volume of CH^ or THF(-175 ml) and then 
precipitating it by slowly adding hexane (1000 ml) while 
stirring (yield 51 g; 80% overall). It can also be recrystal- 
lized (eg., toluene-hexane), but this reduces the yield. 
C. Preparation of l-(4,5- methylenedioxy-2-nitrophenyl) 
ethyl cbloroformate (MeNPOC-Cl) 



coch ^ 

Yolucnc/THF^ 
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A solution of 50 g (0.305 mole) 3,4- 
methylenedioxyacetophenone (Aldrich) in 200 mL glacial 
acetic acid was added dropwise over 30 minutes to 700 mL 
of cold (2-4° C.) 70% HN0 3 with stirring (NOTE: the 
reaction will overheat without external cooling from an ice 
bath, which can be dangerous and lead to side products). At 30 
temperatures below 0° C, however, the reaction can be 
sluggish. A temperature of 3°-5° C. seems to be optimal). 
The mixture was left stirring for another 60 minutes at 3°-5° 
C, and then allowed to approach ambient temperature. 
Analysis by TLC (25% EtOAc in hexane) indicated com- 35 
plete conversion of the starting material within 1-2 nr. When 
the reaction was complete, the mixture was poured into -3 
liters of crushed ice, and the resulting yellow solid was 
filtered off, washed with water and then suction-dried. Yield 
-53 g (84%), used without further purification. 
B. Preparation of l-(4,5-Methylenedioxy-2-nitropbenyl) 

etbanol 
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Phosgene (500 mLof 20% w/v in toluene from Fluka: 965 
mmole; 4 eq.) was added slowly to a cold, stirring solution 
of 50 g (237 mmole; 1 eq.) of l-(4,5-methylenedioxy-2- 
nitrophenyl)ethanol in 400 mL dry THE The solution was 
stirred overnight at ambient temperature at which point TLC 
(20% EtjO/hexane) indicated >95% conversion. The mix- 
ture was evaporated (an oil-less pump with downstream 
aqueous NaOH trap is recommended to remove the excess 
phosgene) to afford a viscous brown oil. Purification was 
effected by flash chromatography on a short (9x13 cm) 
column of silica gel eluted with 20% Et 2 0/hexane. Typically 
55 g (85%) of the solid yellow MeNPOC-Cl is obtained by 
this procedure. The crude material has also been recrystal- 
lized in 2-3 crops from 1:1 ether/hexane. On this scale, -100 
ml is used for the first crop, with a few percent THF added 
to aid dissolution, and then cooling overnight at -20° C (this 
procedure has not been optimized). The product should be 
stored dessicated at -20° C. 

D Synthesis of 5'-MeNPOC-2'-DEOXYNUCLEOSIDE-3'- 
(N,N-DIISOPROPYL 2-CYANOETHYL PHOSPHORA- 

MIDITES 
(1) 5*-MeNPOC-Nucieosides 



Base M "^ q > 
Pyridine ^ 



50 



Sodium borbhydride (10 g; 0.27 mol) was added slowly 
to a cold, stirring suspension of 53 g (0.25 mol) of 4,5- 
methylenedioxy-2-nitroacetophenone in 400 mL methanol. 
The temperature was kept below 10° C. by slow addition of 55 
the NaBH 4 and external cooling with an ice bath. Stirring 
was continued at ambient temperature for another two hours, 
at which time TLC (CR 2 C\J indicated complete conversion 
of the ketone. The mixture was poured into one liter of 
ice-water and the resulting suspension was neutralized with 60 
ammonium chloride and then extracted three times with 400 
mL CH 2 C1 2 or EtOAc (the product can be collected by 
filtration and washed at this point, but it is somewhat soluble 
in water and this results in a yield of only -60%). The 
combined organic extracts were washed with brine, then 65 
dried with MgS0 4 and evaporated. The crude product was 
purified from the main byproduct by dissolving it in a 




MenpocO 




Base 



Base -THYMIDINE (T); N-4-ISOBUTYRYL 
2'-DEOXYCYTIDINE (ibu-dQ; N-2-PHENOXYACETYL 
2 , DEOXYGUANOSINE (PAC-dG); and N-6- 
PHENOXYACETYL 2'DEOXYADENOSINE (PAC-dA) 

All four of the 5'-MeNPOC nucleosides were prepared 
from the base-protected 2*-deoxynucleosides by the follow- 
ing procedure. The protected r-deoxynucleoside (90 
mmole) was dried by co-evaporating twice with 250 mL 
anhydrous pyridine. The nucleoside was then dissolved in 
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300 mL anhydrous pyridine (or 1:1 pyridine/DMF, for the 
dG** c nucleoside) under argon and cooled to -2° C. in an 
ice bath. A solution of 24.6 g (90 mmolc) MeNPOC-Cl in 
100 mL dry THP was then added with stirring over 30 
minutes. The ice bath was removed, and the solution allowed 5 
to stir overnight at room temperature (TLC: 5-10% MeOH 
in CH 2 C1 2 ; two diastereomers). After evaporating the sol- 
vents under vacuum, the crude material was taken up in 250 
mL ethyl acetate and extracted with saturated aqueous 
NaHC0 3 and brine. The organic phase was then dried over 10 
Na 2 S0 4 , filtered and evaporated to obtain a yellow foam. 
The crude products were finally purified by flash chroma- 
tography (9x30 cm silica gel column cluted with a stepped 
gradient of 2#-6% MeOH in CHjClj). Yields of the puri- 
fied diastereomeric mixtures are in the range of 65-75%. 1S 

(2) 5'-MeNPOC-2'-DEOX YNUCLEOSID E-3'-(N,N- 
DIISOPROPYL 2-CYANOETHYL 
PHOSPHORAMIDITES) 



McopocO' 




20 



MenpocO' 




Base 25 



P-OCH2CH2CN 30 



/ 

Y*Y ' 

The four deoxynucleosides were phosphitylated using 35 
either 2-cyanoethyl-N,N-diisopropyl 
chlorophosphoramidite, or 2-cyanoethyl-N,N,N*,N|- 
tetraisopropylphosphorodiamidite. The following is a typi- 
cal procedure. Add 16.6 g (17.4 ml; 55 mmole) of 
2-cyanoethyl-N l N^ , ,N , -tctraisopropylphosphorodiamidite 40 

to a solution of 50 mmole 5'-MeNPOC-nucleoside and 4.3 
g (25 mmole) diisopropylammonium tetrazolide in 250 mL 
dry CH 2 C1 2 under argon at ambient temperature. Continue 
stirring for 4-16 hours (reaction monitored by TLC: 
45:45:10 hexane/CHaCyEtaN). Wash the organic phase 45 
with saturated aqueous NaHC0 3 and brine, then dry over 
Na 2 S0 4 , and evaporate to dryness. Purify the crude amidite 
by flash chromatography (9x25 cm silica gel column eluted 
with hexane/CrWTEA -45:45:10 for A, C, T; or 0:90:10 
for G). The yield of purified amidite is about 90%. .5° 
II. PREPARATION OF LABELED DNA/ 
HYBRIDIZATION TO ARRAY 
1) PCR 

PCR amplification reactions are typically conducted in a 
mixture composed of per reaction: 1 fh genomic DNA; 10 /d 55 
each primer (10 pmol/jtd stocks); 10 /d lOxPCR buffer (100 
mM Tris.Cl pH85, 500 mM KC1, 15 mM MgClJ; 10 /d 2 
mM dNTPs (made from 100 mM dNTP stocks); 2.5 U Taq 
polymerase (Perkin Elmer AmpUTaq™, 5 U//d); and H^O to 
100 /d. The cycling conditions are usually 40 cycles (94° C. 60 
45 sec, 55° C. 30 sec, 72° C. 60 sec) but may need to be 
varied considerably from sample type to sample type. These 
conditions are for 0.2 mLthin wall tubes in a Pcrkin Elmer 
9600 thermocycler. See Perkin Elmer 1992/93 catalogue for 
9600 cycle time information. Target, primer length and 65 
sequence composition, among other factors, may also affect 
parameters. 



For products in the 200 to 1000 bp size range, check 2 p\ 
of the reaction on a 1.5% 0.5xTBE agarose gel using an 
appropriate size standard (phiX174 cut with Haelll is 
convenient). The PCR reaction should yield several pico- 
moles of product. It is helpful to include a negative control 
(i.e., 1/dTE instead of genomic DNA) to check for possible 
contamination. To avoid contamination, keep PCR products 
from previous experiments away from later reactions, using 
filter tips as appropriate. Using a set of working solutions 
and storing master solutions separately is helpful so long as 
one does not contaminate the master stock solutions. 

For simple amplifications of short fragments from 
genomic DNA it is, in general, unnecessary to optimize 
Mg 2 * concentrations. A good procedure is the following: 
make a master mix minus enzyme; dispense the genomic 
DNA samples to individual tubes or reaction wells; add 
enzyme to the master mix; and mix and dispense the master 
solution to each well, using a new filter tip each time. 

2) PURIFICATION . 

Removal of unincorporated nucleotides and primers from 
PCR samples can be accomplished using the Promega 
Magic PCR Preps DNA purification kit. One can purify the 
whole sample, following the instructions supplied with the 
kit (proceed from section IIIB, 'Sample preparation for 
direct purification from PCR reactions')- After elution of the 
PCR product in 50 /d of TE or H 2 0, one centrifuges the 
eluate for 20 sec at 12,000 rpra in a microfuge and carefully 
transfers 45 §A to a new microfuge tube, avoiding any visible 
pellet. Resin is sometimes carried over during the elution 
step. This transfer prevents accidental contamination of the 
linear amplification reaction with 'Magic PCR* resin. Other 
methods, e.g. size exclusion chromatography, may also be 
used. 

'. 3) LINEAR AMPLIFICATION 

In a 0.2 mL thin-wall PCR tube mix: 4 jA purified PCR 
product; 2 /d primer (10 pmoV/d); 4 fA lOxPCR buffer; 4 fi\ 
dNTPs (2 mM dA, dC, dG, 0.1 mM oT); 4/d 0.1 mM dUTP; 
1 u\ 1 mM fluorescein dUTP (Amersham RPN 2121); 1 U 
Taq polymerase (Perkin Elmer, 5 U//d); and add H 2 0 to 40 
u\ Conduct 40 cycles (92° C. 30 sec, 55° C. 30 sec, 72° C. 
90 sec) of PCR. These conditions have been used to amplify 
a 300 nucleotide mitochondrial DNA fragment but are 
generally applicable. Even in the absence of a visible 
product band on an agarose gel, there should still be enough 
product to give an easily detectable hybridization signal. If 
one is not treating the DNA with uracil DNA glycosylase 
(see Section 4), dUTP can be omitted from the reaction. 

4) FRAGMENTATION 

Purify the linear amplification product using the Promega 
Magic PCR Preps DNA purification kit, as per Section 2 
above. In a 0.2 mL thin-wall PCR rube mix: 40 /d purified 
labeled DNA; 4 lOxPCR buffer; and 05 /d uracil DNA 
glycosylase (BRL lU//d). Incubate the mixture .15 min at 
37° C, then 10 min at 97° C; store at -20° C until ready 

to use. ^ 

5) HYBRIDIZATION SCANNING & STRIPPING 

A blank scan of the slide in hybridization buffer only is 
helpful to check that the slide is ready for use. The buffer is 
removed from the flow cell and replaced with 1 mL of 
(fragmented) DNA in hybridization buffer and mixed well. 
The scan is performed in the presence of the labeled target. 
FIG. 33 illustrates an illustrative detection system for scan- 
ning a DNA chip. Aserics of scans at 30 min intervals using 
a hybridization temperature of 25° C yields a very clear 
signal, usually in at least 30 min to two hours, but it may be 
desirable to hybridize longer, i.e., overnight. Using a laser 
power of 50 and 50 /mi pixels, one should obtain 
maximum counts in the range of hundreds to low thousands/ 
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pixel for a new slide. When finished, the slide can be 30 sec) are performed but cycling condidons may need to 

Sped using 50% formamide. rinsing well in deionized be varied. These conditions are for 0.2 mL dim wall tubes .in 

H ToKng and storing at room temperature. Pcrkin Elmer 9600. For products in the 200 to 1000 bp s,ze 

HI PREPARATION OF LABELED RNA range, check 2/d of the reaction on a 1.5%0.5xTBE agarose 

/HYBRIDIZATION TO ARRAY 5 gel using an appropriate size standard. For larger or smaller 

11 IMAGED PRIMERS volumes (20-100 fd), one can use the same amount of 

. Tne primers used to amplify retarget nucleic acid should . genomic ^^[^'^"^^^Ji.^l^^Jf® 16 ^ 611 ^ accoK * in S'y' 
have pramoter sequences if one desires to produce RNA 4) IN VITRO TRANSCRIPTION . . 

from the amplified nucleic acid. Suitable promoter Mix: 3* PCR product; Afd Sxbuffer; 2/d DTT; 10 

sequences are shown below and include: 10 mMrNTPs (100 mM I solutions from Pharmacy 0.48 ^1 10 

m the T3 promoter sequence- mM fluorescein-UTP (Fluorescein-12-UTP, 10 mM 

5'-CGGAATTAACCCTCACTAAAGG (SEQ. ID N0:298) solution, from Boehringer Mannheim); 0 5 fd RNA .poly- 

5VAATTAACCCTCACTAAAGGGAG; (SEQ. ID NO:299) merase (Promega T3 or T7 RNApolymerase); and add H 2 0 

rtii^e^romo erswuencV «o 20/d. Incubate at 37 s C. for 3 b. Check 2fd of the reaction 

5- TAATACGACTCACTATAGGGAG; (SEQ. ID NO:300) is on . . 13% O^xTBE agarose gel ^^^"J 

.«i e\\ th* «;p/5 nmmnter seouence- 5xbuffer is 200 mM Tns pH 75, 30 mM MgC^, 1U mM 

5*ATTTAGGTGACACrATAGAA"(SEQ. ID NO:301) spermidine, 50 mM NaCl, and 100 mM DTT (supplied with 

Tne desired promoter sequence is added to the 5' end of the enzyme). The PCR product needs no pv.nficat.on and can be 

PCR primer. It is convenient to add a different promoter to added directly to the transection mixture. A IQfA reacuon 

each Jrimer of » PCR primer pair so that either strand may 20 is suggested for an ^^^^^ ^ 

be transcribed from a single PCR product. a 100 fA reaction is cons.dered preparative scale (toe 

Synthesize PCR primerfso as to leave the DMT group on. reaction can be scaled up to obtam more tog t) The amount 

DMT-on purification is unnecessary for PCR but appears to o PCR product to add ^eS 

be important for transcription. Add 25 fd 0SM NaOH to will yield several picomoles of DNA. If the PCR reaction 

coUecfion vial prior to collection of oligonucleotide to keep 15 does not produce that much target, then one should increase 

^LC^atTonT^mpl^cd by drying down the UTPsu Jested above is 1 5, but ratios from 1:3,0 ^IC-all 

oUgonucleotides. resuspending in 1 mL 0.1M TEAA (dilute work weU. One can also label with biotm-UTP anoV detect 

S stock in^eionized water, filter through 0.2 micron 30 with streptavidm-FiTC to obtam similar results as wuh 

filter) and filter through 02 micron filter. Load 05 mL on fluorescein-UTP detection. ■ 
reverse phase HPLC (column can be a Hamilton PRP-1 For nondenatunng agarose gel electrophoresis of RNA, 

semtoep ^9426) . The gradient is 0-50% CH 3 CN over note that the RNA band wffl normdly migrate somewhat 

ITnTSrogram 02 ^d.prep.6-50, 25 min). S Pool the faster than the DNA template band, although sometunes the 

desired frLttons. dry dZ, res£peud in 200,4 80% HAc. 35 two bands will ^migrat^ The tempe «m«^e gel can 

30 min RT. Add 200 id EtOH; dry down. Resuspend in 200 effect the migration of the RNA band. The RNA produced 

Sh,0 p us20id NaAc pU557o0fd EtOH. Leave 10 min from in vitro transcription is quite suble and can be stored 

onK«SgVl2^ P rpm for months (at least) at -20' C. 

offsup\™atanLRinsepelktwith 1 mL EtOH, dry. resuspend degradation^ can be stored in "f™^ Jj* 

b . 200/d H20. Dry, resuspend in 200/d TE. Measure A260, 40 triton X- 100 at -20« C for days (at least) and reused I twice 

ptcpu " W pmoty dSSL in TE (10 mM Tris.Cl pH 8.0, (at least) for hybridization without taking any specrt ^re- 

0 1 mM EDTA). Following HPLC purification of a 42 mer, cautions m preparation 

a vield in the vicinity of 15 nmol from a 0.2 /raiol scale should of course be avoided. When extracting RNA from 

y h .-f ,« tt^-T cells, it is preferable to work very rapidly and to use strongly 

8J nfflNOMEcDNA PREPARATION « denaturing conditions. Avoid using glassware previously 

l?TS^oJcl^^ hair, one can contaminated with RNases Use of ™J*£» P£ 

extract as few a! 5 hairs, including hair roots. On a clean and ticware (not necessarily sterilized) » J^**" "J 

sterile surface, one places the hair on a piece of parafilm. and plastic tubes, tips, etc are ^^.^^2^^ 

after wiping a new P razor blade with EtOH cutting off the with DEPC or ^utochving is typically not unnecessary, 
roots, the roots are transferred to a 1.5 mL microfuge tube SO 5) FRAGMENTATION . 
using a pair of Millipore forceps cleaned with EtOH. Add . In a 0.2 mL thin-wall PCR tube -.m .18 £ ™Md*« 
5()0^(l P 0mMTris.ClpH8.0,10 P mMEDTA,100mMNaCl, ■^ ttu ^^^^ 0 ^^^^£& 

2% (w/v) SDS, 40 mM DTT, filter sterilized) to the sample. H ? 0; and 4/d 1M Tns.Q pH9.0. Incubate at C- » J» 

Add 1.25/d 20 mg/ml proteinase K (Boehringer) Incubate at min. Add to 1 mL hybridization buffer and store at -20 C 

55' C. for 2 hour? vortexing once or twice. Perform 2x0.5 SS until ready to use. The alkabne hydrolys* step .s very 

mi i i inhnri ICHCL extractions After each extraction, reliable. The hydrolysed target can be stored at -20 C. in 

Srt^-WO^S min to a° mia^gelnS "cover 0.4 6xSSPE/0.1% Triton X-100 for a, least several days pnor to 

mL supernatant. Add 35 fd NaAc pH5.2 plus 1 mL EtOH. u* • '^^.^^^.^^ & STRI pp ING 
Place ample on ice 45 min; then centrifuge 12,000 rpm 30 6) HYBRmiZATION SCA^NG^ & • SJRff P^G 
min, rinse; air dry 30 min. and resuspend in 100 fd TE. 60 A b ank scan of the slide in W***^*™. » 

5ft PCR helpful to check that the slide is ready for use. The butler is 

PCR is performed in a mixture containing, per reaction: 1 removed from the flow cell and replaced with 1 mL of 

id ^m^DmAtd^vZcTilO pmZw stocks); 4/d (hydrolysed) RNA in hybridization buffer and mixed well. 

IoSSS^OO^STtScI pH8 .5, 500 mM KC1, 15 Incubate for 15-30 min at 18» C Remove the hybn«hzation 

^SSw^«SiaH^tmVnnMiia? « solution which «nl be saved te«J>^«V»M 

stocks)? 1 U Taq polymerase (Perkin Elmer, 5 U//d); H,0 to Ru« the flow cell 4-5 tunes with fresh ^^f ^fg 

40/d. About 40 cycles (94« C. 30 sec, 55* C. 30 sec, 72* C. 0.1% Tnton X-100, equilibrated to 18 C. The rinses can be 
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performed rapidly, but it is importaat to empty the flow cell 
before each new rinse and to mix the liquid in the cell 
thoroughly. The scan is performed in the presence of the 
labeled target. A series of scans at 30 min intervals using a 
hybridization temperature of 25° C. yields a very clear 
signal, usually in at least 30 min to two hours, but it may be 
desirable to hybridize longer, i.e., overnight. Using a laser 
power of 50 /tW and 50 ftm pixels, one should obtain 
maximum counts in the range of hundreds to low thousands/ 
pixel for a new slide. When finished, the slide can be 
stripped using 50% to 100% formamide at 50° C. for 30 min, 



36 



10 



rinsing well in deionized H 2 0, blowing dry, and storing at 
room temperature. 

These conditions arc illustrative and assume a probe 
length of -15 nucleotides. The stripping conditions sug- 
gested are fairly severe, but some signal may remain on the 
slide if the washing is not stringent. Nevertheless, the counts 
remaining after the wash should be very. low in comparison 
to the signal in presence of target RNA. In some cases, much 
gentler stripping conditions are effective. The lower the 
hybridization temperature and the longer the duration of 
hybridization, the more difficult it is to strip the slide. Longer 
targets may be more difficult to strip than shorter targets. 



SEQUENCE LISTING 



( 1 ) GENERAL INFORMATION: 

(Ml ) NUMBER OF SEQUENCES: 360 



( 2 ) INFORMATION FOR SEQ ID NO:l: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 bax pain 
( B ) TYPE: mcWc add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ED NO:l: 

TTCCTGACGT CAGCC 15 

( 2 ) INFORMATION FOR SEQ ID NO:2: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pair* 
( B ) TYPE: nodeic tdd 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (pnU) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOtl 

I 5 

TTGCTGACAT CAGCC 



( 2 ) INFORMATION FOR SEQ ID NO*J: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base ptre 
( B )TYPE nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( ! i ) MOLECULE TYPE DNA (probe) 

. ( x I ) SEQUENCE DESCRIPTION: SEQ ID NOJ: 

TTGCTGACCT CAGCC 



( 2 ) INFORMATION FOR SEQ ID NO:4: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B ) TYPE nndek tdd 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 1 ) MOLECULE TYPE DNA (probe) 



( x 1 ) SEQUENCE DESCRIPTION: SEQ ED NO-.4: 
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TTOCTCACTT CAOCC 
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1 5 



( 2 ) INFORMATION FOR SEQ ID NO*: 

( I ) SEQUENCE CHARACTERISTICS: 
. ( A ) LENGTH: 39 base pain 

( B ) TYPE: noclcic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (oligonucleotide) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:5: 

CATTAAAGAA AATATCATCT TTGGTGTTTC CTATGATGA 



3 9 



( 2 ) INFORMATION FOR SEQ ID NO:6: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 36 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:6: 

CATTAAAGAA AATATCATTG GTGTTTCCTA TGATGA 



3 6 



< 2 ) INFORMATION FOR SEQ ID NO:7: 

< 1 ) SEQUENCE CHARACTERISTICS: 
' ' ( A ) LENGTH: 36 base pain 
( B ) TYPE: nodeic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (genomic) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:7: 

CATTAAAGAA AATATCATTG GTGTTTCCTA TGATGA 



3 6 



( 2 ) INFORMATION FOR SEQ ID NO& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B) TYPE: Bttdelc add 
( C ) STRANDEDNESS: •ingle 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID HOiS: 

A A C A C C A A TO ATGAT • 



1 5 



( 2 ) INFORMATION FOR SEQ ID NOA 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pairs 
( B ) TYPE: nndefc add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY, linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( z I ) SEQUENCE DESCRIPTION: SEQ ID NO$: 

CCAAAGATNA TATTT 



1 5 



( 2 ) INFORMATION FOR SEQ ID NOtlfc 
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-continued 



( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B) TYPE: nodeic acid 
( C ) STRANDEDNESS: aingle 
(D)TOPOLOGY; linear 

( I I ) MOLECULE TYPE: DNA(piobe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO: 10: 

ACCAAAOANG ATATT 



( 2 ) INFORMATION FOR SEQ [D NO:lt: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B ) TYPE: ttoddc acid 
( C ) STRANDEDNESS: linglc 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: D.VA (probe) 

(xl) SEQUENCE DESCRIPTION: SEQ ID Nail: 

CACCAAAGNT GATAT 



( 2 ) INFORMATION FOR SEQ ID NO-.L2: 

< I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B ) TYPE nodeic acid 
( C ) STRANDEDNESS: amde 
( D ) TOPOLOGY: linear 

(II ) MOLECULE TYPE: DNA (probe) , 

(xl) SEQUENCE DESCRIPTION: SEQ ID NO:l2: 

ACACCAAANA TGATA 



< 2 ) INFORMATION FOR SEQ ID NO:D: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B ) TYPE: oodek add 
( C ) STRANDEDNESS: imgle 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:U: 

AACACCAANG ATGAT 



( 2 ) INFORMATION FOR SEQ ID NO:U: 

( i ) SEQUENCE CHARACTERISTTCS: 
( A ) LENGTH: 15 base pain 
( B ) TYPE: bodefe add 
( C ) STRANDEDNESS: ahdc 
( D) TOPOLOGY: Unear 

( I I ) MOLECULE TYPE: DNA (probe) 

< x I ) SEQUENCE DESCRD7TION: SEQ ID NO:14: 

AAACACCANA GATGA 



( 2 ) INFORMATION FOR SEQ ID NO:15: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B ) TYPE: nodeic add 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 




41 
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( f I ) MOLECULE TYPE: DXA (probe) 
( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO 15: 
CAAACACCN'A ACATC 



1 5 



( 2 ) INFORMATION FOR SEQ ID NO: 16: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B)TYPE:otideicadd 
( C ) STRANDEDNESS: (Ingle 
( D ) TOPOLOGY: linear 

( I f ) MOLECULE TYPE: DXA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOU6: 

CCAAACACNA A AG A T 



1 5 



( 2 ) INFORMATION FOR SEQ ID NO:l7: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pairs 
( B ) TYPE: nodeic add 
( C ) STRANDEDNESS: shgle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( Jt i ) SEQUENCE DESCRIPTION: SEQ ID NO:l7: 

AGGAAACANC AAAGA 



1 S 



( 2 ) INFORMATION FOR SEQ ID NO: 18: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 21 base pairs 
( B ) TYPE: norfcfe add 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( 1 i ) MOLECULE TYPE: DXA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO: 1 8: 

CCTTCAGAGO GTAAAATTAA G 



2 1 



( 2 ) INFORMATION FOR SEQ ID NO:l5: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 21 base pairs 
( B ) TYPE: nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DXA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:19: 

CCTTCAG AGT GTAAAATTAA G 



2 1 



( 2 ) INFORMATION FOR SEQ ID NO-JO: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 44 base pairs 
( B ) TYPE: nodeic add 
( C ) STRANDEDNESS: sfagle 
( D ) TOPOLOGY: Gear 

( I 1 ) MOLECULE TYPE: DNA (probe) 



( i I ) SEQUENCE DESCRIPTION: SEQ CD NOJOt 
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TAATACGACT CACTATAOCO AOATCACCTA ATAATGATGG OTTT 



4 4 



( 2 ) INFORMATION FOR SEQ ID NO:2l: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 43 base pair* 
( B ) TYPE: Dodcic «cU 
( C ) STRANDEDNESS: sbgle 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DXA (probe) 

( * 1 ) SEQUENCE DESCRIPTION: SEQ ID NO-.21: 

TAATACGACT CACTATAGGG AGTAGTGTGA AGGGTTCATA TGC 



4 3 



( 2 ) INFORMATION FOR SEQ ID NO:22: 

< I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 45 base pain 
( B ) TYPE: nudclc acid 
( C ) STRANDEDNESS: .ingle 
( D ) TOPOLOGY: their 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N022: 

CTCGGAATTA ACCCTCACTA AAGGTAGTGT GA AGGGTTC A TATGC 



4 5 



( 2 ) INFORMATION FOR SEQ ID NO:ZJ: 

( I ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 43 base pain . 
( B ) TYPE: aodelc ictd 
( C ) STRANDEDNESS: single 
< D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N023: 

TAATACGACT CACTATAGGG AGACCATACT AAAAGTGACT CTC 



4 3 



( 2 ) INFORMATION FOR SEQ ID NO:24: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 44 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: Unear 

( f I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO£4: 

TAATACGACT CACTATAGGG AGACATGAAT G A CAT T T AC A CCA A 



4 4 



( 2 ) INFORMATION FOR SEQ ID N03S: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 44 base pain 
( B ) TYPE: nodelc add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: Unear 

( I I ) MOLECULE TYPE DNA (probe) 
( x I ) SEQUENCE DESCRIPTION: SEQ ID HQOSt 
CGGAATTAAC C CTC A CT A A A GGACATGAAT GACATTTACA CCAA 



4 4 



( 2 ) INFORMATION FOR SEQ ID NQ26: 



* 
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< I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B )TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:26": 

TTTATGGGGT G A 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:27: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-.27: 

TTGATTTATG GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:28: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( il) SEQUENCE DESCRIPTION: SEQ ID NO:2& 

A ACC TATTTG ATT 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:29: 



( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nncleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO-.29: 

GG AC C A AA C C TA 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:30: 

( i ) SEQUENCE CHARACTERISTICS: 
. ( A ) LENGTH: 12 base pain 
( B ) TYPE: nncleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 



( I 1 ) MOLECULE TYPE: DNA (probe) 
( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NOOO: 
AGGCTAGGAC CA 



( 2 ) INFORMATION FOR SEQ Q> NO-.Jl: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nncleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i i ) MOLECULE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:31: 
CCTCTCTGTG TGC 



( 2 ) INFORMATION FOR SEQ ID NO:32: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: U base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:J2: 

CCOTOTGTGT GTGC 



( 2 ) INFORMATION FOR SEQ ID NO:33: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 14 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

(kI) SEQUENCE DESCRIPTION: SEQ ID NO:33: 

GCTGTGTGTG TGCT 



( 2 ) INFORMATION FOR SEQ ID NO:34: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B )TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:34: 

CTGGGTAGGA TG 



( 2 ) INFORMATION FOR SEQ CD NO:35: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nndeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(li) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO J5: 

T GCTGGOT AG GA 



( 2 ) INFORMATION FOR SEQ ED NO-.36: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ CD NO-J6: 
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TGTCCTCCGT AO 



1 2 



( 2 ) INFORMATION FOR SEQ ff> NO:37: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:37: 



CTTACCAGCG GT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO JS: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS : tingle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOO& 



CGGTTAGCAG CG 



I 2 



( 2 ) INFORMATION FOR SEQ ID NOJ9: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: U base pain 
( B ) TYPE: oodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N039: 



( 2 ) INFORMATION FOR SEQ ID NO:40: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE oodeic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:40: 

AGCGGGGG AG 10 

( 2 ) INFORMATION FOR SEQ Q> NO:41: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pain 
( B ) TYPE nodcic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

(It) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:41: 

GGTTGGTTCG G 1 1 



( 2 ) INFORMATION FOR SEQ ID NO:42: 
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( i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear .". 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:42: 

GGGTTTOOTT GO 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO-.43: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i 1 ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:43: 

GATCTTTGGG GT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:44: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:44: 

GGGTGATCTT TG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:45: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:45: 

TGTGGGGGGT GA 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:46: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE nocleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( * i ) SEQUENCE DESCRIPTION: SEQ ID X&.46 : 

TAAACTGTGG GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:47: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i 1 ) MOLECULE TYPE: DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:47: 
■ OCT AC A T AAA CTG 



( 2 ) INFORMATION FOR SEQ ID NO:48: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(M) MOLECULE TYPE: DNA (probe) 

(ti) SEQUENCE DESCRIPTION: SEQ ID NO:*S: 

OAOOTAAGCT A C A 



( 2 ) INFORMATION FOR SEQ ID NO:49: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:49; 

GAGGAGOTAA GC 



( 2 ) INFORMATION FOR SEQ ID NO:50: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE* nucleic ocid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:50: 

TGCTTTGAGG AG 



( 2 ) INFORMATION FOR SEQ ID NO:51: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:5i: 

AGTGTATTGC TTT 



( 2 ) INFORMATION FOR SEQ ID NO:52: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO-J2: 
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CATTTTCAGT GTA 



( 2 ) INFORMATION FOR SEQ ID NO:53: 

( .1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pair* 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 



(II) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:53: 



TAAACATTTT CAG 



( 2 ) INFORMATION FOR SEQ ID NO:54: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

(i i ) SEQUENCE DESCRIPTION: SEQ ID NO:54: 



AGCCCGTCTA AA 



( 2 ) INFORMATION FOR SEQ ID NO:55: 



( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:55: 

GAGCCCGTCT AA 



( 2 ) INFORMATION FOR SEQ ID NO:56: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(11) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOJ6: 

TGATOTGAG C CC 

( 2 ) INFORMATION FOR SEQ Q> NO:57: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:57: 



GGOGTGATGT G A 



( 2 ) INFORMATION FOR SEQ ID NO-.58: 
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( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; LI base pain 
( B ) TYPE: nndeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear . 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD N0:5& 

CACTGGGACG G 



( 2 ) INFORMATION FOR SEQ tD NO:59: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:59: 

GTATGGGAGT GG 



( 2 ) INFORMATION FOR SEQ ID NO:60: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 14 base pain 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

(xl) SEQUENCE DESCRIPTION: SEQ CD NO:60: 

GATTAGTAGT ATGG 



( 2 ) INFORMATION FOR SEQ ID NO:6I: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRDTION: SEQ tD NOrfl: 

TGAATGAGAT TAG 



( 2 ) INFORMATION FOR SEQ ID NO:62: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: U base pain 
( B ) TYPE: Qodcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CO NO:62: 

ATTGAATGAG ATT 



( 2 ) INFORMATION FOR SEQ ID N&63: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( I i ) MOLECULE TYPE: DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID N0^3: 
CCOTTGTATT C A A 



1 3 



( 2 ) INFORMATION FOR SEQ ID M><«: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B )TYPE*nacleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:64: 

GCGGGGGTTG 



I 0 



( 2 ) INFORMATION FOR SEQ ID NO:65: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:65: 

ATGGGCGGGG 



1 0 



( 2 ) INFORMATION FOR SEQ ID NO:66: . 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:66: 

TAGGATGGGC G 1 1 



( 2 ) INFORMATION FOR SEQ D> NO:67: 

( i ) SEQUENCE CHARACTERISTICS: 
( A) LENGTH: 12 base pain 
( B )TYPE: noddc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( . I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:67: 

TGGGTAGGAT GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID N04S: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i t ) MOLECULE TYPE: DNA (probe) 



( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:68: 
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CTGCTGCCTA GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:69: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pin 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
(O)TOPOLOGY: linear 

( 1 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ED NO:69: 

TCTGTGTGCT GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:70: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE; nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:70: 

GCGGTGTGTG TG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:7l: 

( I ) SEQUENCE CHARACTERISTICS: 
. ( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:71: 

TAGCAGCGGT GT 

( 2 ) INFORMATION FOR SEQ ID NO:72: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:72: 

TGGGGTTAGC AG 



1 2 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:73: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-.73: 

GGTATGGGGT T A 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:74: 
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< i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic »cid 
( C ) STRANDEONESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:74: 

GTTCGGGGTA TG 



( 2 ) INFORMATION FOR SEQ ID NO:75: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEONESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:75: 

GCTGGTGTTA GG 



( 2 ) INFORMATION FOR SEQ ID NO:76: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

{ 1 i ) MOLECULE TYPE: DNA (probe) 

SEQUENCE DESCRIPTION: SEQ ID NO:76: 

GGTTAGGCTC GT 



( 2 ) INFORMATION FOR SEQ ID NO:77: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE- nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:77: 

AAATCTGGTT AG G 



( 2 ) INFORMATION FOR SEQ ID NO:78: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO-.78: 

AAATTTGAAA TCT 



( 2 ) INFORMATION FOR SEQ ID NO: 79: 

( f ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i i ) MOLECULE TYPE: DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:79: 
AAGATAAAAT TTG 



( 2 ) INFORMATION FOR SEQ ID NOrSO: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B )TYPE: noclelc acid 
( C ) STRANDEDNESS: single 
( D ) TOFOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:80: 

GCCAAAAACA TA 



( 2 ) INFORMATION FOR SEQ ID NO:8l: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO£l: 

CCCCAAAAAG A 



( 2 ) INFORMATION FOR SEQ ID NO.S2: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:82: 

CATACCGCCA A 



( 2 ) INFORMATION FOR SEQ ID NO-.S3: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:83: 

AAAAGTGCAT ACC 



( 2 ) INFORMATION FOR SEQ ID NO&4: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic tcid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO** 
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TGTTAAAACT CCA 



I 3 



( 2 ) INFORMATION] FOR SEQ ID NO:85: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DXA (probe) 



( Jt i ) SEQUENCE DESCRIPTION: SEQ ID NO:85: 



GGOTG AC TCT T A A 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:86: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
< D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 



(il) SEQUENCE DESCRIPTION: SEQ ID NO:86: 



GGGGGTGACT GT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:87: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 



( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:87: 



AGTTGGGGGG T 



1 1 



( 2 ) INFORMATION FOR SEQ ID NO:89; 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: Unear 

( I I ) MOLECULE TYPE: DNA (probe) 



( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:88: 



TGTGTTAGTT GGG 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO£9: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE nodefe acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:89: 

AAAAT A ATGT GTT 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO40: 
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( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO^O: 

AGGOG A A A AT AA 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:91: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

(it) SEQUENCE DESCRIPTION: SEQ ID NOS1: 



GGAGGGGAAA AT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:92: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

(xl) SEQUENCE DESCRIPTION: SEQ ID N052: 



GG AA AT T T T T TG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:93: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ED NOS3: 

CGTGGAAATT TT 



1 2 



( 2 ) INFORMATION FOR SEQ □> NO:94: 

( I ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 11 base pain 
( B ) TYPE nncleie acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: In 



( 1 I ) MOLECULE TYPE: DNA (probe) 

( i I ) SEQUENCE DESCRIPTION: SEQ Q> NO04: 



GGTTTGGTGG A 



1 1 



( 2 ) INFORMATION FOR SEQ ID NO:95: 



( t ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pain 
( B )TYPE; nndelc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i i ) MOLECULE TYPE* DNA (probe) 

(il) SEQUENCE DESCRIPTION: SEQ ID NO*95: 



GAGGGGGGGT T 



1.1 



( 2 ) INFORMATION FOR SEQ CD NO:96: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO*96: 

CCGCGGGACC 



1 0 



( 2 ) INFORMATION FOR SEQ ID NO:97: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:97: 

CAGAAGCGGG G 



1 1 



( 2 ) INFORMATION FOR SEQ ID NO#8: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID NOSS: 
GTAGGC C AG A AG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:99: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B )TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:99: 

GTGCTGTAGG CC 



1 2 



( 2 ) INFORMATION FOR SEQ ID NOUOO. 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B )TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:100: 
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TGTTTAAGTC CTC 



1 3 



( 2 ) INFORMATION FOR SEQ CD NO:10l: . 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B)TVPE*oudeIc Kid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:l£H: 

TGTGTTTAAC TGC 



1 3 



( 2 ) INFORMATION FOR SEQ CD NO:102: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:102: 

GCAGAGATGT GTT 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:103: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:103: 

TTTGGCAGAG AT 



1 2 



< 2 ) INFORMATION FOR SEQ ID NO:104: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B )TYPE: nucleic Kid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:104: 

GGCCTTTGOC A 



1 1 



( 2 ) INFORMATION FOR SEQ D> NO:105: 

( I ) SEQUENCE CHARACTERISTICS: 
( A) LENGTH: 12 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NOUQS: 

TGTTTTTGGG GT 



1 2 



( 2 ) INFORMATION FOR SEQ CD NO-.106: 
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( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) 5TRANDEDNES5: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE; DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:106: 

TTTOTTTTTG GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO: 107: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDED NESS : single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:107: 

GCCT TCTTTG TT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:10& 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:10& 

GTGTTAGGGT TCT 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:109: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 14 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO: 109: 

TTTAGTAAGT ATGT 



1 4 



( 2 ) INFORMATION FOR SEQ ID NO: 110: 

(1 ) SEQUENCE .CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ED NO: 110: 

A AC AC A C T T T AGT 



1 3 



( 2 ) INFORMATION FOR SEQ ID NOrlll: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 14 base pain 
( B ) TYPE: nndcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i i ) MOLECULE TYPE: DNA (probe) 
( i i ) SEQUENCE DESCRIPTION: SEQ CD N0:1U: 
AATTAATTAA CACA 

( 2 ) INFORMATION FOR SEQ ID NO:ll2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE* DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:ll2: 

AAGCATTAAT T A A 



( 2 ) INFORMATION FOR SEQ ID NO:lt3: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:ll3: 

GTCCTACAAG CAT 



1 3 



( 2 ) INFORMATION FOR SEQ ID N0:1U: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0:1U: 

TGTCCTACAA GCA 



1 3 



( 2 ) INFORMATION FOR SEQ Q> NO: 115: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 i ) MOLECULE TYPE* DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:llS: 

ATT AT TATGT CCT 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO: 116: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: U base pairs 
( B )TYPE: nodeie add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE* DNA (probe) 



( x I ) SEQUENCE DESCRIPTION: SEQ ID NO: 116: 
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TTCTTATTAT TATG 



( 2 ) INFORMATION FOR SEQ ID NO:ll7: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic icid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEO Q> NO:ll7: 

ATTCAAATTG TTA 



( 2 ) INFORMATION FOR SEQ ID NO:118: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ED NO:118: 

GCAGACATTC AAA 



( 2 ) INFORMATION FOR SEQ ID NO: 119: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B.) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:119: 

GCTGTGCAGA CA 



( 2 ) INFORMATION FOR SEQ ID NO:l20t 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ tO NO:120: 

AAAGTGGCTG TG 



( 2 ) INFORMATION FOR SEQ CD NO:121: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE- DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:121: 

TGTGTGG A AA G TG 



( 2 ) INFORMATION FOR SEQ ID NO:l22: 
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( I ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 13 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:122: 

GATOTCTCTG TOG 



( 2 ) INFORMATION FOR SEQ ID NO:123: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE* DNA (probe) 

(t)) SEQUENCE DESCRIPTION: SEQ ID NO:t23: 

ATGATGTCTG TOT 



( 2 ) INFORMATION FOR SEQ ID N0:124: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nncteic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:124: 

TTTTGTTATG ATG 



( 2 ) INFORMATION FOR SEQ ID NO:125: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x \ ) SEQUENCE DESCRIPTION: SEQ 0) NO:125: 

TTTTTTGTTA T G A 



( 2 ) INFORMATION FOR SEQ £D N0-.12& 

( I ) SEQUENCE CHARACTERISTICS: 

( A ) LENGTH: 12 base pain 1 
( B ) TYPE nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY, linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ff» N0:126: 

ATAGGGTGCT CC 



( 2 ) INFORMATION FOR SEQ ID NO:127: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( I i ) MOLECULE TYPE: DNA (probe) 
( z 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:127: 
GCCACATAGG OT 



( 2 ) INFORMATION FOR SEQ CD NO:l28: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:12& 

TACTGCGACA TAG 



( 2 ) INFORMATION FOR SEQ ID NO: 129: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0:129: 

GACAGATACT GCG 



( 2 ) INFORMATION FOR SEQ ID NO:L30t 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B )TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:13fc 

AATCAAAGAC AGA 



( 2 ) INFORMATION FOR SEQ ID NO:l31: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B )TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: Unear 

(II ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:131: 

AGOAATCAAA G A C 



( 2 ) INFORMATION FOR SEQ ID NChUi 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nocleic sdd 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: Unear 

(II) MOLECULE TYPE: DNA (probe) 




I 2 



1 3 



( x I ) SEQUENCE DESCRIPTION: SEQ ID NO: 132: 
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TCACCCACGA AT 



( 2 ) INFORMATION FOR SEQ ID NOrU3: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
" ( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:l 33: 

A G 0 AT G A GG C AG 



( 2 ) INFORMATION FOR SEQ ID Nttl34: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE; nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:134: 

AAATAATAGG ATG 



( 2 ) INFORMATION FOR SEQ ID NO:L35: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0:135: 

GCGATAAATA AT 



( 2 ) INFORMATION FOR SEQ ID N0:l36: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE nodeic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( z 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:l36: 

TACGATGCGA TA 



( 2 ) INFORMATION FOR SEQ H> NO:137: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE nodeic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 t ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-.137: 

GTAGCATGCG AT 



( 2 ) INFORMATION FOR SEQ ID NO: US; 




87 



5,837,832 



-continued 



88 



( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I.I ) MOLECULE TYPE: D.VA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ [D NO: 138: 

TTGAACGTAC GA 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO: 139: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEONESS: tingle 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0:139: 

AATATTGAAC G T A 



1 3 



( 2 ) INFORMATION FOR SEQ ED NO: 140: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ED NO:140: 

GCCTGTAATA TTG 



1 3 



( 2 ) INFORMATION FOR SEQ CD NO: 141: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ CD N0:141: 

TGTTCGCCTG TA 



1 2 



( 2 ) INFORMATION FOR SEQ CO NO: 141 

( f ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: shgle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO: 142: 

OTATCTTCGC CT 



1 2 



( 2 ) INFORMATION FOR SEQ ED NO:143: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i I ) MOLECULE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION; SEQ ID N0:143: 
CTCCCOTGAG TO 



( 2 ) INFORMATION FOR SEQ ED N0:144: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0:144: 

G AG AGCT CCC GT 



( 2 ) INFORMATION FOR SEQ ID NOU45: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: [bear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO: 145: 

AT GG AG AGCT CC 



( 2 ) INFORMATION FOR SEQ ID NO:I46: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0:146*: 

A AT GC ATGGA GA 



( 2 ) INFORMATION FOR SEQ ID NOU47: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO: 147; 

AT AC C A A A TG CA 



( 2 ) INFORMATION FOR SEQ ID NOU4& 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nndcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 



( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:148: 




• 
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CACGAAAATA CCA 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:U9: 

< I ) SEQUENCE CHARACTER BTTCS: 
( A ) LENGTH: 11 base pain 
( B ) TYPE: nodek acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ Q> N0:149: 

CCC AG A CGAA A 



1 1 



( 2 ) INFORMATION FOR SEQ ID N045& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: U base pain 
( B ) TYPE: nodek add 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO: 150: 

TACCCCCCAG A 



1 1 



( 2 ) INFORMATION FOR SEQ ID N0:15l: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: nodek add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0:151: 

TCCATACCCCC 



1 1 



( 2 ) INFORMATION FOR SEQ ID N0:152: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nodek add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

(li) SEQUENCE DESCRIPTION: SEQ ID NO: 152: 

TCGCGTGCAT AC 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO: 153: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nndek add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I ] ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID N0:153: 

GACTATCGCG TG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO: 15* 
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( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( « 1 ) SEQUENCE DESCRIPTION: SEQ ID NO: 154: 

ATCACTATCC CG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NOU55: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRAN DEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: DXA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:l55: 

CTCGCAATGA CT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:L5& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOrlStf: 

CGTCTCGCAA TG l2 



( 2 ) INFORMATION FOR SEQ Q> N0:157: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE* nndeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DXA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:157: 

CTCCAGCGTC TC l2 



( 2 ) INFORMATION FOR SEQ ID NO-.1S& 

(I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: U base pairs 
( B ) TYPE: nndeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:158: 

TCCGGCTCCA C 



( 2 ) INFORMATION FOR SEQ ID N0:159: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: nndeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 



95 



5,837,832 

-continued 



( i i ) MOLECULE TYPE: DNA (probe) 
( * i ) SEQUENCE DESCRIPTION: SEQ ID NO: 159: 
OTOCTCCCOC T 



( 2 ) INFORMATION FOR SEQ ID NO:160t 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic Kid 
( C ) STRANDEDNESS: single 
( 0 ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:t60: 

GACCCTGAAG TAG 



( 2 ) INFORMATION FOR SEQ ID NO:l61: 

( I ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO: 1 61: 

TTTATCACCC TG A 



( 2 ) INFORMATION FOR SEQ ID NO:l62: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:162: 

TTTAGGCTTT ATG 



( 2 ) INFORMATION FOR SEQ ID NO:163: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

. ( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ IDNO:163: 

CCTATTTAGG CT 



( 2 ) INFORMATION FOR SEQ ID NO:Ifi4: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 



( i I ) SEQUENCE DESCRIPTION: SEQ ID NOU64: 
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TCGGCTATTT AC 



1 2 



( 2 ) INFORMATION FOR SEQ ED NO: 165: 



( i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 12 base pairs 
( B ) TYPE: nodcic acid 
( C ) STRANDED NESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:165: 

ACCTGTGGGC TA 



1 2 



( 2 ) CNFORMATION FOR SEQ ID N0:166: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DMA (probe) 

(xi) SEQUENCE DESCRIPTION: SEQ CD NO: 166: 



AGGGGA ACG T GT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO: 167: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

Cxi) SEQUENCE DESCRIPTION: SEQ CD NO:167: 



TTTAAGGGGA AC 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:l6& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 14 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD N0:168: 



ATGTCTTATT TAAG 



1 4 



( 2 ) INFORMATION FOR SEQ CD NO: 169: 

( I ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:169: 



CATCGTGATG TCT 



1 3 



( 2 ) CNFORMATION FOR SEQ CD NChl7G: 




( 1 ) SEQUENCE CHARACTERISTICS: 



( A ) LENGTH; 12 base pain 
( B ) TYPE: nodcic acid 
( C ) STRANDEDNESS: single 
. ( D ) TOPOLOGY:, linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:170: 

TCCATCCTGA TG 



( 2 ) INFORMATION FOR SEQ © NO:17i: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 12 base pain 
( B )TYPE:nocleIc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( * 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:17l: 

GATOATCCAT CG 



( 2 ) INFORMATION FOR SEQ ID NO:172: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOU72: 

AGACCTGATG ATC 



( 2 ) INFORMATION FOR SEQ ID NO:17J: 

. ( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0:173: 

GGCTGATACA CCT 



( 2 ) INFORMATION FOR SEQ ID N0U74: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nodetc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0:174: 

ATAGGGTGAT AGA 



( 2 ) INFORMATION FOR SEQ ID NO:175: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nodefc acid 
( C ) STRANDEDNESS: tingle 
(D)TOPOLOGY: linear 
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( i i ) MOLECULE TYPE: DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID NO: 175: 
TOOT TA AT AG CO 12 

( 2 ) INFORMATION FOR SEQ ID NO:t76: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B )TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: D.VA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO: 1 76: 

GTGAGTGGTT A AT 13 

( 2 ) INFORMATION FOR SEQ ID N0:177; 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0:177: 

TGTGCGGGAT AT 12 

( 2 ) INFORMATION FOR SEQ ID NO:l7& 

< I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nocldc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0:178: 

ACTCTTGTGC GG 12 

( 2 ) INFORMATION FOR SEQ ID NO:179: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic odd 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:179: 

TACCACTCTT GTG 13 

( 2 ) INFORMATION FOR SEQ ID NO: 180: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: U base pairs 
( B ) TYPE: nucleic tcld 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO: 180: 
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CGACAGTAGC ACT 

( 2 ) INFORMATION FOR SEQ ID NO:18l: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic scid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( * I ) SEQUENCE DESCRIPTION: SEQ ID NO:l81: 

CCCACCAGAG TA 

( 2 ) INFORMATION FOR SEO ID NOU82: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE.- DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:lS2: 

CCGAGCGAGG A 

( 2 ) INFORMATION FOR SEQ ID NO:l83: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE DNA (probe) 

( * i ) SEQUENCE DESCRIPTION: SEQ ID NO:153: 

GGCCCGGAGC 

( 2 ) INFORMATION FOR SEQ ID NO: 184: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE nucleic scid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ED NO:184: 

TTATGGGCCC G 

( 2 ) INFORMATION FOR SEQ CD NO:185: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE nodeic scid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:l85: 

AGTGTTATGG GC 



1 3 



1 2 



1 1 



1 0 



1 1 



1 2 



( 2 ) INFORMATION FOR SEQ CD N&186: 
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( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 btse pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i 1 ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:186: 

TACCCCCAAG TG 



( 2 ) INFORMATION FOR SEQ ID NO:187; 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:187: 

TTTAGCTACC CC 



( 2 ) INFORMATION FOR SEQ ID NO:188: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

. ( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:188: 

TTCACTTTAG CT A 



( 2 ) INFORMATION FOR SEQ ID NO:189: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE nncleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:189: 

T AC AG TTC AC TTT 



( 2 ) INFORMATION FOR SEQ CD NO:190: 

( I ) SEQUENCE CHARACTERISTICS: 

( A ) LENGTH: 13 base pain 5 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO: 1 90: 

TCGAGATACA GTT 



( 2 ) INFORMATION FOR SEQ CD NO:l9l: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nncleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i i ) MOLECULE TYPE: DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ U> NO:l9l: 
C AOATC TCGA OAT 



( 2 ) INFORMATION FOR SEQ ID N0:192: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 bas« pain 
( B ) TYPE- nucleic acid 
( C ) STRANDEDNESS: shgle 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO: 1 92: 

AGGAACCAGA TG 



( 2 ) INFORMATION FOR SEQ ID NChl93: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nodelc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:193: 

GAAGTAGGAA CCA 



( 2 ) INFORMATION FOR SEQ ID NO:194: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B )TYPE:nndeIc add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:194: 

GACTGTAATG TGC 



( 2 ) INFORMATION FOR SEQ ID N0:195: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE nocleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:195: 

GGOATTTGAC TG T 



( 2 ) INFORMATION FOR SEQ ID NO:196: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B )TYPE*nadeIc add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 



( x ! ) SEQUENCE DESCRIPTION: SEQ ID NO:196: 




( 2 ) INFORMATION FOR SEQ ID NO:197: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0:197: 

ACGAGAAGGG AT 

1 2 

( 2 ) INFORMATION FOR SEQ ID NO-.1S& 

( I ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 12 base pairs 
< B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ U> NO:198: 

TGGGCACCAC AA 

1 2 



( 2 ) INFORMATION FOR SEQ ID NChl99: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x ! ) SEQUENCE DESCRIPTION: SEQ fD NO:l99: 

ATCCATGGGG AC 



( 2 ) INFORMATION FOR SEQ ID NO:200fc 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
.(B) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:200: 

GGTCATCCAT GG 



( 2 ) INFORMATION FOR SEQ ID NO:201: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO301: 

AGGGGGGTCA T 

1 1 



( 2 ) INFORMATION FOR SEQ ED NO:202: 
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( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) • 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:2Q2: 

TATCTGACGC GG 



( 2 ) INFORMATION FOR SEQ ID NO:203: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: oodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: D.VA (probe) 

( * i ) SEQUENCE DESCRIPTION: SEQ ID NO:203: 

ACCCCTATCT GA 



( 2 ) INFORMATION FOR SEQ ID NO:204: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DXA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:2W: 

AGGG ACCCCT A 



( 2 ) INFORMATION FOR SEQ ID N&20S: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: oodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

SEQUENCE DESCRIPTION: SEQ ID NO:205: 

TGGTCAAGGG AC 



( 2 ) INFORMATION FOR SEQ ID NO:206: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:206: 

GGATGGTGGT CA 



( 2 ) INFORMATION FOR SEQ ID NO:207: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: Unev 
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( I I ) MOLECULE TYPE; DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:207: 
ACGATOCTGO TC 



( 2 ) INFORMATION FOR SEQ ID NO:20& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: ooclcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:20& 

ACACGOAGGA TG 



( 2 ) INFORMATION FOR SEQ ID NO:209: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:209: 

TGATTTACAC GG 



( 2 ) INFORMATION FOR SEQ ID NO:210t 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:210: 

GGGATATTGA TTT 



( 2 ) INFORMATION FOR SEQ ID NO:211: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pahs 
( B ) TYPE: nucleic odd 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:211: 

GTOOCATTTG OA 



( 2 ) INFORMATION FOR SEQ ID N0:212: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pairs 
( B ) TYPE: oodelc acid 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: linear 



( I 1 ) MOLECULE TYPE: DNA (probe) 



( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:212: 
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ACCOOTCGCA T 



{ 2 ) INFORMATION FOR SEQ ID N0:213: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: LI base pairs 
( B ) TYPE: Qoclcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: D.VA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:2l3: 

GGTGAGGGGT G 



( 2 ) INFORMATION FOR SEQ ID NO:2U: 

( I ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 12 base pain 
( B ) TYPE* nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:214: 

AGTCGGTGAG GG 



( 2 ) INFORMATION FOR SEQ ID NO:215: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:215: 

GTATCCTAGT G GG 



( 2 ) INFORMATION FOR SEQ ID NO:2l6: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE nucleic add 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE* DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:216: 

TTTGTTGGTA TCC 



( 2 ) INFORMATION FOR SEQ CD NO:2i7: 

( i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 13 base pairs 
( B ) TYPE nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-217: 

GTACGTTTGT TGG 



( 2 ) INFORMATION FOR SEQ ID NO-21& 
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( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE nncleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DMA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:2l8: 

TOOCTAOGTT TO 



( 2 ) INFORMATION FOR SEQ ID NO:219: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

(il) SEQUENCE DESCRIPTION: SEQ ID NO:219: 

TAAGGGTGGG TA 



( 2 ) INFORMATION FOR SEQ ID NO:220: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) . 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:220: 

GTACTGTTAA GGG 



( 2 ) INFORMATION FOR SEQ ID NO:22L- 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 14 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:221: 

TGT ACTATGT ACTG 



( 2 ) INFORMATION FOR SEQ ID NO:222: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE nncleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:222: 

GGCTTTATGT ACT 



( 2 ) INFORMATION FOR SEQ ID NO:223: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE oadcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i i ) MOLECULE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:223: 
AAATOOCTTT AT 



( 2 ) INFORMATION FOR SEQ ID NO:224: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic odd 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:224: 

GGTAAATGGC TT 



( 2 ) INFORMATION FOR SEQ ID NO:225: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:225: 

TCTACGGTAA A TO 



( 2 ) INFORMATION FOR SEQ ID NO:22& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE* nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE- DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NOr226: 

GTGCTAATGT ACG 



( 2 ) INFORMATION FOR SEQ ID NO:227: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE* nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

. ( i i ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:227: 

TAATGTGCTA ATG 



( 2 ) INFORMATION FOR SEQ ID NO:22& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 



( x I ) SEQUENCE DESCRIPTION: SEQ ID NO£28: 
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CATGGCGACC G 



( 2 ) INFORMATION FOR SEQ tD NO:229: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:229: 

TGTAAGCATC GG 



( 2 ) INFORMATION FOR SEQ 10 NO:230: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic Kid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ © NO:230: 

TTGCTTGTAA GCA 



( 2 ) INFORMATION FOR SEQ ID NO:23l: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:231: 

TGTACTTGCT TGT 



( 2 ) INFORMATION FOR SEQ ID NO:232: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE' nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ CD NO:232: 

TTGCTCTACT TGC 



( 2 ) INFORMATION FOR SEQ ID NO:233: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:233: 

GGTTGATTGC TG 



( 2 ) INFORMATION FOR SEQ CD NO-.234: 
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( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
. ( D ) TOPOLOGY: linear 

( I t ) MOLECULE TYPE: DNA (probe) 

( * I ) SEQUENCE DESCRIPTION: SEQ ID NO:234: 

TTGAGGOTTO AT 



( 2 ) INFORMATION FOR SEQ ID NO:235: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 bue pairs 
( B ) TYPE nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:235: 

GTCATACTTO AGO 



( 2 ) INFORMATION FOR SEQ ID NO:236: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE- DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:23fi: 

TTGATGTGTG ATA 



( 2 ) INFORMATION FOR SEQ © N0237; 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nodcic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( 1 i ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:237: 

TGCAGTTGAT GTG 



( 2 ) INFORMATION FOR SEQ ID NO:23& 

(I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nudric acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:23& 

TCGAGTTCCA GT 



( 2 ) INFORMATION FOR SEQ ID NO-.239: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: oodetc acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 
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( I 1 ) MOLECULE TYPE; DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:239: 
ATT T OC AG TT' CC 

( 2 ) INFORMATION FOR SEQ ID NO: 240: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:240: 

TACCGTACAA TAT 

( 2 ) INFORMATION FOR SEQ ID NO:241: . 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:24l: 

T GOT ACCGTA CAA 

( 2 ) INFORMATION FOR SEQ ID NO:242: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE* nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i 1 ) MOLECULE TYPE: DNA (probe) 

( Jt I ) SEQUENCE DESCRIPTION: SEQ ID NO:242: 

TATTTATGGT ACC 

( 2 ) INFORMATION FOR SEQ ID NO:243: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B )TYPE*nodeie Kid 
( C ) STRANDEDNESS: single . 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0243: 

GOT C A AGT AT TTA 



( 2 ) INFORMATION FOR SEQ ID NO:244: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B )TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE* DNA (probe) 



( Jf I ) SEQUENCE DESCRIPTION: SEQ ID NO:244: 
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TAC AGGTCCT CAA 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:245: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:245: 

ATGTACTACA GGT 



1 3 



( 2 ) INFORMATION FOR SEQ ID NG.246: 

( J ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0246: 

GGTTTTTATG TAC 



1 3 



( 2 ) INFORMATION FOR SEQ ID NCh247: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH 12 base pain 
( B )TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:247: 

GGATTGGGTT TT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:248: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0248: 

TGTAGGATTG GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:249: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:249: 

GTTTTGATGT AGG 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:2Sa 
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( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:250: 

GGGTTTTGAT GT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:251: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 11 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:25l: 

GGACCCCGTT T 



1 1 



( 2 ) INFORMATION FOR SEQ 0> NO:252: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I J ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:252: 

GTCAATACTT GGG 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:253: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:253: 

GGGTGAGTCA ATA 



( 2 ) INFORMATION FOR SEQ ID NO:254: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x ! ) SEQUENCE DESCRIPTION: SEQ ID NO:254: 

TGGGTGAGTC AA 



( 2 ) INFORMATION FOR SEQ ID NO: 23 5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i 1 ) MOLECULE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:255: 
T'CTTC A TC GO TG 

( 2 ) INFORMATION FOR SEQ ID NO:25& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
< B ) TYPE- nucleic acid 
( C ) STRANDEDNESS: stogie 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0256: 

CGGTTGTTGA TO 

( 2 ) INFORMATION FOR SEQ ID NO:257: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:257: 

ACATAOCGGT TG 

( 2 ) INFORMATION FOR SEQ ID NO:25& 

< I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:25& 

GAAAATACAT AGC 



1 2 



1 2 



1 2 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:259: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0259: 

AATGTACGAA A A T 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:26Q: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE- DNA (probe) 



( x 1 ) SEQUENCE DESCRIPTION: SEQ ID N0260: 
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GCAGTAATGT ACG 



1 3 



( 2 ) INFORMATION FOR SEQ CD NO:261; 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH* 12 base pairs 
< B ) TYPE: oDcleic icld 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE; DNA (probe) 

( * I ) SEQUENCE DESCRIPTION: SEQ CD NO:261: 

TGGCTGGCAG TA 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:262: 

( I ) SEQUENCE CHARACTERISTICS: 

( A ) LENGTH: 12 base pairs ' 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:262: 

TCATGGTGGC TG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:263: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
-(C) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:263: 

ACAATATTCA TGG 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:264: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( M ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION; SEQ ID NO:264: 

TAGAATCTTA GCT 



1 3 



( 2 ) INFORMATION FOR SEQ CD NO: 26 5: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ CD NO-265: 

TTTAAATTAG A A T 



1 3 



( 2 ) INFORMATION FOR SEQ CD NO:2fi6: 
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( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nodetc acid 
( C ) STRAND EDNESS : single 
( D ) TOPOLOGY: linear 

(1.1) MOLECULE TYPE DNA (probe) . 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:266: 

GAATAAOTTT AAA 



( 2 ) INFORMATION FOR SEQ ED NO:267 : 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nodetc icid 
( C ) STRANDED NESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:267 : 

GAACAGAGAA T A A 



( 2 ) INFORMATION FOR SEQ ID NO:26& 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE nndeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ED NO:268: 

AAAGAACACA G A A 



( 2 ) INFORMATION FOR SEQ ED NO-.269: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ED NO:269: 

CCCATGAAAG AA 



( 2 ) INFORMATION FOR SEQ ED NO:27Cfc 

(. I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I ! ) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ED NO:27Q: 

TTCCCCATOA AA 



( 2 ) INFORMATION FOR SEQ ED NO:27l: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i i ) MOLECULE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION': SEQ ID NO:271: . 
ATCTGCTTCC CC 



( 2 ) INFORMATION FOR SEQ CD NO:272: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:272: 

CA A ATCTGCT TC 



( 2 ) INFORMATION FOR SEQ ID NO:273: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE- DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:273: 

GGTACCCAAA TC 



( 2 ) INFORMATION FOR SEQ CD NO:274: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ [D NO:274: 

GGTGGT A CCC AA 



( 2 ) INFORMATION FOR SEQ CO NO:275: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:275: 

TACTTGGGTG GT 



( 2 ) INFORMATION FOR SEQ CD NO:276: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 



( x I ) SEQUENCE DESCRIPTION: SEQ CO NO:276: 
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TCCAAAAACC TT 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO:277: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:277: 



GTCCTTGGAA AA 



1 2' 



( 2 ) INFORMATION FOR SEQ ID NO:27& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nodeic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:278: 



ATTTGTCCTT GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID N0279: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE' nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:279: 



CTCTGATTTG TCC 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO-.280: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:280: 



TTTTTCTCTG ATT 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO-J8b 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE nudetc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ED N03S1: 



TAAAGACTTT TTC 



1 3 



( 2 ) INFORMATION FOR SEQ ED NO-.282: 





141 



5,837,832 



•continued 



142 



( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pairs 
( B ) TYPE: nucleic »cid 
( C ) STRANDEDNESS: single 
( 0 ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DXA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOSS2: 



GTOGAGTTAA A G A 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO: 283: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 13 base pain 
( B ) TYPE: nucleic tcid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:283: 



TGGTGGAGTT AAA 



1 3 



( 2 ) INFORMATION FOR SEQ ID NO:2S4: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 buc pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID X&28*: 



TGCTAATGGT GG 



1 2 



( 2 ) INFORMATION FOR SEQ ID NO-.285: 



( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B )TYPE*nodeic add 
( C ) STRANDEDNESS: single • 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:2&5: 



( 2 ) INFORMATION FOR SEQ ID NO:286: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: smgle 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N02&5: 

TAGCTTTGGG TG 12 



( 2 ) INFORMATION FOR SEQ ID NO:287: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i » ) MOLECULE TYPE DNA (probe) 
( * i ) SEQUENCE DESCRIPTION: SEQ ID NO:287: 
TCTTAGCTTT CO 



( 2 ) INFORMATION FOR SEQ ID NO:28& 

< I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 22 base pain 
( B ) TYPE: ooclcfc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:288: 

C ACTTGTGCC CTGACTTTCA AC 



( 2 ) INFORMATION FOR SEQ ID NO:28fc 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 49 base pain 
( B ) TYPE' nocleic acid 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: linear 

( i i ) MOLECULE TYPE- DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:289: 

ATGCAATTAA CCCTCACTAA ACGGAGACAC TTGTGCCCTG ACTTTCAAC 

( 2 ) INFORMATION FOR SEQ ID NO:290t 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 25 base pain 
( B ) TYPE* nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE- DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:290: 

GACCCTGGGC AACCAGCCCT G T COT 



( 2 ) INFORMATION FOR SEQ ID M>29L 

( ! ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 47 base pain 
( B ) TYPE nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:29l: 

T A AT ACGACT CACTATAGGG AGGACCCTGG GCAACCAGCC CTGTCGT 



( 2 ) INFORMATION FOR SEQ CD NO:292: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 25 base pain 
( B ) TYPE nodeic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( ( I ) MOLECULE TYPE DNA (probe) 



( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:292: 
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CTACAATTCT GTTGACTCAG ATTGG 25 

( 2 ) INFORMATION FOR SEQ ID NO:293: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH 27 base pairs 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: DNA (probe) 

( * I ) SEQUENCE DESCRIPTION: SEQ ID NO:293: 

AAATCCATAC AATACTCCAG TATTTGC 27 

( 2 ) INFORMATION FOR SEQ ID NO:294: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 27 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:294: 

GATAACCTTC GGCCTTATCT ATTCCAT 27 

( 2 ) INFORMATION FOR SEQ ID NO:295: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 28 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( ! I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:295: 

ACCCATCCAA AGGAATGGAG GTTCTTTC 28 

( 2 ) INFORMATION FOR SEQ ID NO:296: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: DNA (oligonucleotide) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:296": 

AGCCTAGCTG AA 12 

( 2 ) INFORMATION FOR SEQ ID NO: 297; 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B )TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 1 ) MOLECULE TYPE: DNA (oligonucleotide) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:297: 

TCGGATCGAC TT 12 



( 2 ) INFORMATION FOR SEQ ID NO:29& 
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( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 22 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS : single 
(D)TOPOLOGY: linear 

( i i ) MOLECULE TYPE* D.VA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:298: 

COGAATTAAC CCTCACTAAA GG 



2 2 



( 2 ) INFORMATION FOR SEQ ID NO:299: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 22 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:299: 

A AT T AACCCT CACTAAAGCC AG 



2 2 



( 2 ) INFORMATION FOR SEQ ID NO:30ft 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 22 base pain 
( B ) TYPE oneiric acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:300: 

TAATACGACT CACTATAGGC AG 



2 2 



( 2 ) INFORMATION FOR SEQ ID NO:J01: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:301: 

ATTT AGGTGA CACTATAGAA 20 



( 2 ) INFORMATION FOR SEQ CD NO:302: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: lObase pain 
( B ) TYPE* nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NOJ02: 



( 2 ) INFORMATION FOR SEQ CD NO:303: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B )TYPE QQdcicacid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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(II) MOLECULE TYPE: DNA (probe) 
( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-J03: 
AG ANG AT AT T 



( 2 ) INFORMATION FOR SEQ ID NO-J04: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE: onclelc acid 
(C) STRAND ED NESS: tingle 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOJOt: 

AAGNTGATAT 



( 2 ) INFORMATION FOR SEQ ID NO JOS: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE* nucleic acid 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:J05: 

AAANATGATA 



( 2 ) INFORMATION FOR SEQ ID NO:306: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO-J06: 

CAANOATGAT 



( 2 ) INFORMATION FOR SEQ CD NO-J07: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(it) MOLECULE TYPE: DNA (probe) . 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NO:307; 

CCANAOATGA 



( 2 ) INFORMATION FOR SEQ CD NO:30& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I t ) MOLECULE TYPE DNA (probe) 



( x 1 ) SEQUENCE DESCRIPTION: SEQ CD NOJ08: 
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ACCNAACATC 



1 0 



( 2 ) INFORMATION FOR SEQ ID NO:309: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( i I ) SEQUENCE DESCRIPTION: SEQ ID NOJ09: 

CACNAAAOAT 



1 0 



( 2 ) INFORMATION FOR SEQ ID NO:310: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:31(fc 

AGAAACNACA 



1 0 



( 2 ) INFORMATION FOR SEQ ED NO:3U: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 16 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NOJU: 

ATTTCATTCT GTATTG 



1 6 



( 2 ) INFORMATION FOR SEQ ID NO: 3 12: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 16 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOJ12: 

CCGA CTGCAG TCGTTA 



1 6 



( 2 ) INFORMATION FOR SEQ ED NCWl* 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE* DNA (probe) 

( * i ) SEQUENCE DESCRIPTION: SEQ ID NOJU: 

CCGACTCCAG TCGTT 



I S 



( 2 ) INFORMATION FOR SEQ CD NO:314: 
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< I ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 15 base pain 
( B) TYPE nndeic acid 
( C ) STRANDEDNESS: single 
( D)TOPOLOGY: linear 

(II ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:3M: 

CCCACTACAO TCGTT 



( 2 ) INFORMATION FOR SEQ ID N0:315: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pain 
( B ) TYPE* oocleic add 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: linear 

(II) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:315: 

CCGACTCCAG TCGTT 

1 5 

( 2 ) INFORMATION FOR SEQ ID NO:316: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 15 base pairs 
( B ) TYPE: nndeic acid 
( C) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (ptobe) , 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOJi<5: 

CCCACTTCAG TCGTT 

1 5 

( 2 ) INFORMATION FOR SEQ ID N0017: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 35 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOJ17: 

CTAATTTCTT TT A TAG TAG A AACCACAAAC GATAC 



( 2 ) INFORMATION FOR SEQ ID NO:31& 

( I ) SEQUENCE CHARACTERISTICS: . 
. ( A ) LENGTH: 35 base pain 
( B ) TYPE: nndeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (oligonucleotide) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOJ1& 

CATTAAAGAA AATATCATCT TTGGTGTTTC CTATG 



( 2 ) INFORMATION FOR SEQ CD NO-J19: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 32 base pain 
( B ) TYPE: oocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( I i ) MOLECULE TYPE: DNA (oligonucleotide) 
( x I ) SEQUENCE DESCRIPTION: SEQ CD NO J 19: 
CATTAAACAA AATATCATTC . GTCTTTCCTA.. TC : 32 



( 2 ) INFORMATION FOR SEQ ID NO:320: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 18 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:320: 

CATTAAACAA A A TA T CAT 



( 2 ) INFORMATION FOR SEQ CD NO:321: 

( I ) SEQUENCE CHARACTERISTICS: 

( A ) LENGTH: 35 base pain / 

( B ) TYPE: nucleic acid 

( C ) STRANDEDNESS: stogie 

( D ) TOPOLOGY: linear 

( 1 i ) MOLECULE TYPE: DNA (oligonucleotide) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NOJ21: 

T AT T A A AG A A AATATCATCT TTOOTGTTTC CTATC 35 



( 2 ) INFORMATION FOR SEQ CD NO:322: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 35 base pairs 
( B ) TYPE* nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i 1 ) MOLECULE TYPE: DNA (oligonucleotide) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO 322: 

CCTTAAAGAA AATATCATCT TTGGTGTTTC CTAAA 35 



( 2 ) INFORMATION FOR SEQ CD NO:323: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 35 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (oUgonucIetide) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ CD NO:323: 

CTTTAAAGAA A A T A A A A A A A TTGGTGTTTC CTAAA 3 5 



( 2 ) INFORMATION FOR SEQ ID NO:324: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: stogie 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 



( x I ) SEQUENCE DESCRIPTION: SEQ CD N0424: 
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GGAACTCTCC CATTTTAATT 



( 2 ) DEFORMATION FOR SEQ ID NO:325: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 20 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( k I ) SEQUENCE DESCRIPTION: SEQ ID NO:325: 

CCTTCAG AGG GTAAAATTAA 



( 2 ) INFORMATION FOR SEQ ID NO:326: 

( I ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 20 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( ! I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:326: 

CCTTCAOACK GTAAAATTAA 



( 2 ) INFORMATION FOR SEQ ID NOJ27: 

( i ) SEQUENCE CHARACTERISTICS: . 

( A ) LENGTH: 20 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO J27: 

CCTTCAGAGT GTAAAATTAA 



( 2 ) INFORMATION FOR SEQ ID NO:328: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH* 19 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

(M) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:328: 

CCTTCACACO GTAAAATCA 



( 2 ) INFORMATION FOR SEQ fl> NCh329: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 19 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( m I ) SEQUENCE DESCRIPTION: SEQ ED NO-J29: 

CCTTCAG AGG G TAAAATTA 



( 2 ) INFORMATION FOR SEQ ID NOO30: 
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( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: L9 one pain 
( B ) TYPE: nucleic tcid 
( C ) STRANDEDNESS: single 
( DJTOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ © NOJ30: 

CATTCAGAOT GTAAAATAC 



( 2 ) INFORMATION FOR SEQ D> NO:331: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 19 base pain 
( B JTYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:331: 

AAAAAAGAGT GTAAAATGA 



( 2 ) INFORMATION FOR SEQ Q> NCh332: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 35 base pain 
( B ) TYPE: cockle acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE* DNA (oligonucleotide) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:332: 

CATTAAAGAA AATAACATCA TTGGTGTTTC CTATG 



( 2 ) INFORMATION FOR SEQ ID NO:333: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 648 base pain 
( B ) TYPE: oodcic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (oligonucleotide) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO 333: 



A AC AAACCTA 


CCCACCCTTA 


ACAGTACAT A 


GTACATAAAG 


CCATTTACCG 


TACATAGCAC 


6 0 


ATTACAGTCA 


AATCCCTT CT 


CGTCCCCATG 


GATGACCCCC 


CTCACATAGG 


GGTCCCTTGA 


12 0 


CCACCATCCT 


CCGTCAAATC 


AATATCCCGC 


ACAAGAGTGC 


TACTCTCCTC 


GCTCCGGGCC 


18 0 


CATAACACTT 


CGCCGTACCT. 


AAAGTGAACT 


GTATCCGACA 


TCTGGTTCCT 


ACTTCAGGGT 


2 4 0 


CATAAAGCCT 


A A A TAG C CCA 


CACGTTCCCC 


T T A A AT A A GA 


CATCACGATG 


GATCACAGOT 


3 0 0 


CT ATCACCCT 


ATTAACCACT 


CACGGGAGCT 


CTCCATGCAT 


TTGGTATTTT 


CGTCTGGGCG 


3 6 0 


GT ATGCACGC 


0 A TAG C A TT G 


CGAGACGCTG 


GAGCCGGAGC 


ACCCTATGTC 


GCAGTATCTG 


4 2 0 


TCTTTGATTC 


CTGCCTCATC 


CT ATTATTT A 


TCGCACCTAC 


GTTCAATATT 


ACAGGCGAAC 


4 8 0 


ATACTTACTA 


AAGTGTGTT A 


ATTAATTAAT 


GCTTGT AGGA 


CATAATAATA 


ACAATTGAAT 


5 4 0 


OTCTCCACAG 


CCACTTTCCA 


CACAGACATC 


ATAACAAAAA 


ATTTCCACCA 


AACCCCCCCT 


6 0 0 


CTCCCCCGCT 


TCTGGCCAC A 


GCACTTAAAC 


ACATCTCTGC 


CA AACCCC 




6 4 8 



( 2 ) INFORMATION FOR SEQ ID NO:334: 



1 9 



1 9 
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( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: oncleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:334: 

CATGCTGAOO AG 



( 2 ) INFORMATION FOR SEQ Q> NO:335: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nocleic icid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOJ35: 

CTCCTCCCCG GT 



( 2 ) INFORMATION FOR SEQ Q> NO:336: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ Q> NOJ36: 

ACTCCTCCCC GG 



( 2 ) INFORMATION FOR SEQ ID NO-J37: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID N0037; 

GACTCCTCCC CG 



( 2 ) INFORMATION FOR SEQ ID NOJ3& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH 12 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( J I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NOJ3& 

CGACTCCTCC CC 



( 2 ) INFORMATION FOR SEQ ID NCM39: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 
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( i i ) MOLECUUE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION: SEQ 0? NO:339: 
ACGACTCCTC CC 



( 2 ) INFORMATION FOR SEQ tt> NO:340: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE* nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:340: 

TACOACTCCT CC 



( 2 ) INFORMATION FOR SEQ ID N0J41: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE* nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID HOMU 

CTACOACTCC TC . 



( 2 ) INFORMATION FOR SEQ ID NO:342: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO*342: 

TCTACOACTC CT 



( 2 ) INFORMATION FOR SEQ ID NO:343: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( J i ) MOLECULE TYPE DNA (probe) 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:343: 

TTCTACGACT CC 



( 2 ) INFORMATION FOR SEQ ID N0O44: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pairs 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 



( I i ) MOLECULE TYPE DNA (probe) 



( x I ) SEQUENCE DESCRIPTION: SEQ ID N0-J44: 
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ATTCTACGAC TC 

( 2 ) INFORMATION FOR SEQ ID NO:345: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE- nucleic *cid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:345: 

TATTCTACGA CT 



( 2 ) INFORMATION FOR SEQ ID NO:346: 



( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic scid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: (bear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID 



CTATTCTACG AC 



( 2 ) INFORMATION FOR SEQ ID NO:347: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 12 base pain 
( B ) TYPE: nucleic acid . . 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:J47: 

CCTATTCTAC GA 

( 2 ) INFORMATION FOR SEQ ID NO:34& 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE* nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY linear 

( 1 I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO 

TCCTCCCCOG 

( 2 ) INFORMATION FOR SEQ ID NO;349: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE nucleic add 
( C ) STRANDEDNESS: single 
( D) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-J49: 

CTCCTCCCCG 



( 2 ) INFORMATION FOR SEQ (D NO:350: 




( A ) LENGTH: 10 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 



( I I .) MOLECULE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:350: 
ACTCCTCCCC 

1 0 

( 2 ) INFORMATION FOR SEQ ID NO:35l: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE: aocleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( * i ) SEQUENCE DESCRIPTION: SEQ CD NO:351: 

GACTCCTCCC 

1 0 



( 2 ) INFORMATION FOR SEQ CD NO-J52: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ CD NO:352: 

CGACTCCTCC 

1 0 



( 2 ) INFORMATION FOR SEQ CD NO: J53: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE- nucleic acid 
( C ) STRANDEDNESS: single 
(D)TOPOLOGY: Linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ CD NOJ53: 

ACGACTCCTC 

1 0 



( 2 ) INFORMATION FOR SEQ CD NO:354: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B )TYPE:oadeicacid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 1 ) MOLECULE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION: SEQ CD NO-J54: 
TACGACTCCT 

1 0 

( 2 ) INFORMATION FOR SEQ CD NO:355: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE: nucleic acid 

( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 




( i I ) MOLECULE TYPE: DNA (probe) 



( x i ) SEQUENCE DESCRTmON: SEQ ID NO:355: 
CTACOACTCC 

.10 

( 2 ) INFORMATION FOR SEQ ID NOJ56: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE: nucleic acid 
( C ) STRAND ED NESS: italic 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: DNA (probe) 
( x I ) SEQUENCE DESCRIPTION: SEQ ID NOJ56: 
TCTACOACTC 

1 0 

( 2 ) INFORMATION FOR SEQ ID NOJ57; 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pain 
( B ) TYPE: noctelc add 
( C ) STRANDEDNESS: tingle 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:357: 

TTCTACCACT 

10 

( 2 ) INFORMATION FOR SEQ ID NO-35& 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE- nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0358: 

ATTCTACOAC 

1 0 



( 2 ) INFORMATION FOR SEQ ID NO-J59: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 10 base pairs 
( B ) TYPE: nodelc acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (probe) 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:359: 

T ATTCTACG A 

1 0 

( 2 ) INFORMATION FOR SEQ ID NO-J60: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 184 base pairs 
( B ) TYPE: uncteic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: DNA (oligonucleotide) 

( * 1 ) SEQUENCE DESCRIPTION: SEQ ID NOJ60: 



5,837,832 





1 T1 
1 /I 




•continued 
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TACTCCCCTG 


CCCTCAACAA 


GATGTTTTOC 


CAACTGGCCA 


AGACCTGCCC 


TGTGCAGCWG 


6 0 


KCCCWWCATT 


CCACACCCCC 


GCCCGGCACC 


CGCGTCCGCG 


CCATGGCCAT 


CTACAAGCAG 


1 2 0 


TCACACCACA 


TCACGGACCW 


WGKGAGGCGC 


TGCCCCCACC 


ATGAGCGCYG 


C Y C AG AT AG C 


1 8 0 


SAYC 












1 8 4 



We claim: 

1. An array of oligonucleotide probes immobilized on a 
solid support, said array having at least 100 probes and no 
more than 100,000 different oligonucleotide probes 9 to 20 
nucleotides in length occupying separate known sites in said 
array, said oligonucleotide probes comprising at least four 
sets of probes: (1) a first set that is exactly complementary 
to a reference sequence and comprises probes that com- 
pletely span the reference sequence and, relative to the 
reference sequence, overlap one another in sequence; and (2) 
three additional sets of probes, each of which is identical to 
said first set of probes but for at least one different 
nucleotide, which different nucleotide is located in the same 
position in each of the three additional sets but which is a 
different nucleotide in each set. 

2. The array of claim 1, further comprising a fourth 
additional set of probes, which fourth additional set is 
identical to probes in the first set. 

3. The array of claim 1, wherein said reference sequence 
is a double-stranded nucleic acid and probes complementary 
to both strands of said reference are in said array. 

4. The array of claim 1, wherein said probes are 12 to 17 
nucleotides in length. 

5. The array of claim 4, wherein said probes are 15 
nucleotides in length and attached by a covalent linkage to 
a site on a 3'-end of said probes, and said different nucleotide 
is located at position 7, relative to the 3'-end of said probes. 

6. The array of claim 1, wherein said reference sequence 
is exon 10 of a CFTR gene, and said array has between 1000 
and 100,000 oligonucleotide probes 10 to 18 nucleotides in 
length. 

7. The array of claim 6, wherein said array comprises a set 
of probes comprising a specific nucleotide sequence selected 
from the group of sequences consisting of: 
3-TTTArAXTAG (SEQ ID. NO:302); 
3'-TTATAGXAGA (SEQ ID. NO:303); 
3'-TATAGTXGAA (SEQ ID. NO:304); 
3'-ATAGTAXAAA (SEQ ID. NO:305); 
3'-TAGTAGXAAC (SEQ ID. NO:306); 
3'-AGTAGAXACC (SEQ ID. NO:307); 
3'-GTAGAAXCCA (SEQ ID. NO:308); 
3 f -TAGAAAXCAC (SEQ ID; NO:309); and 
3-AGAAACXACA (SEQ ID. NO:310); wherein each set 

comprises 4 probes, and X is individually A, G, C, and T 
for each set. 

8. The array of claim 6, wherein said group of sequences 
consists of: 

3'-TTTATAXTAGAAACC (SEQ ID. NO:9); 
3*-TTATAGXAGAAACCA (SEQ ID. NO:10); 
3-TATAGTXGAAACCAC (SEQ ID. NO:ll); 
3'-ATAGTAXAAACCACA (SEQ ID. NO:12); 



10 

3-TAGTAGXAACCACAA (SEQ ID. NO:13); 
3-AGTAGAXACCACAAA (SEQ ID. No:14); 
3'-GTAGAAXCCACAAAG (SEQ ID. NO: 15); 
3-TAGAAAXCACAAAGG (SEQ ID. NO:16); and 
15 3-AGAAACXACAAAGGA (SEQ ID. NO:17); wherein 
each set comprises 4 probes, and X is individually A, G, 
C, and T for each set 

9. The array of claim 1, wherein said reference sequence 
is a sequence of a D-loop region of human mitochondrial 

20 DNA 

10. The array of claim 9, wherein said probes are 15 
nucleotides in length, and said array comprises a first set of 
probes exactly complementary to a sequence contained in a 
sequence bounded by positions 16280 to 356 of the refer- 
ence sequence and four additional sets of probes identical to 
said first set but for position 7, relative to a 3'-end of a probe, 
which 3 -end is covalently attached to the substrate, where, 
for each of the four additional probe sets, a different nucle- 

3Q otide is located, such that, for each probe in said first set, 
there is an identical probe in one of the four additional sets, 
and such that the array has between 2500 and 100,000 
oligonucleotide probes. 

11. The array of claim 1, wherein said reference sequence 
35 is a sequence from an exon of a human p53 gene. 

12. The array of claim 11, wherein said reference 
sequence comprises at least a 60 nucleotide contiguous 
sequence from exon 6 of a p53 gene. 

13. The array of claim 11, wherein said reference 
40 sequence is exon 5 of a p53 gene, said probes are 17 

nucleotides long, and said array comprises a first set of 
probes exactly complementary to said sequence and at least 
three additional sets of probes, each set comprising probes 
identical to said first set but for a nucleotide at position 7, 
45 relative to a 3'-end of a probe, which 3'-end is covalently 
attached to the substrate, which nucleotide is different from 
a nucleotide at this position in a corresponding probe of said 
first set. 

14. The array of claim 1, wherein said probes are oli- 
50 godeoxyribonucleotides. . ; 

15. The array of claim 1, wherein said array has between 
10,000 and 100,000 probes. 

16. The array of claim 1, wherein the reference sequence 
is from a human immunodeficiency virus. 

55 17. The array of claim 16, wherein the reference sequence 
is from a reverse transcriptase gene of the human immuno- 
deficiency virus. 

18. The array of claim 1, wherein said probes are immo- 
bilized to said solid support via a linker. 
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ARRAYS OF MODIFIED NUCLEIC ACID 
PROBES AND METHODS OF USE 

CROSS -REFERENCE TO RELATED 
APPLICATION 

This application is a continuation-in-part of U.S. Ser. No. 
08/440,742 filed May 10, 1995 abandoned, which is a 
continuation-in-part of PCT application (designating the 
United States) SN PCT/US94/12305 filed Oct. 26, 1994, 
which is a continuation-in-part of U.S. Ser. No. 08/284,064 
filed Aug. 2, 1994 abandoned, which is a continuation-in- 
part of U.S. Ser. No 08/143,312 filed Oct. 26, 1993 
abandoned, each of which is incorporated herein by refer- 
ence in its entirety for all purposes. 

FIELD OF THE INVENTION 

The present invention provides probes comprised of 
nucleotide analogues immobilized in arrays on solid sub- 
strates for analyzing molecular interactions of biological 
interest, and target nucleic acids comprised of nucleotide 
analogues. The invention therefore relates to the molecular 
interaction of polymers immobilized on solid substrates 
including related chemistry, biology, and medical diagnostic 
uses. 

BACKGROUND OF THE INVENTION 

The development of very large scale immobilized poly- 
mer synthesis (VLSIPS™) technology provides pioneering 
methods for arranging large numbers of oligonucleotide 
probes in very small arrays. See, U.S. application Ser. No. 
07/805,727 now U.S. Pat. No. 5,424,186 and PCT patent 
publication Nos. WO 90/15070 and 92/10092, each of which 
is incorporated herein by reference for all purposes. U.S. 
patent application Ser. No. 08/082,937, filed Jun. 25, 1993, 
and incorporated herein for all purposes, describes methods 
for making arrays of oligonucleotide probes that are used, 
e.g., to determine the complete sequence of a target nucleic 
acid and/or to detect the presence of a nucleic acid with a 
specified sequence. 

VLSIPS™ technology provides an efficient means for 
large scale production of miniaturized oligonucleotide 
arrays for sequencing by hybridization (SBH), diagnostic 
testing for inherited or somatically acquired genetic 
diseases, and forensic analysis. Other applications include 
determination of sequence specificity of nucleic acids, 
protein-nucleic acid complexes and other polymer-polymer 
interactions. 

SUMMARY OF THE INVENTION 

The present invention provides arrays of oligonucleotide 
analogues attached to solid substrates. Oligonucleotide ana- 
logues have different hybridization properties than oligo- 
nucleotides based upon naturally occurring nucleotides. By 
incorporating oligonucleotide analogues into the arrays of 
the invention, hybridization to a target nucleic acid is 
optimized. 

The oligonucleotide analogue arrays have virtually any 
number of different members, determined largely by the 
number or variety of compounds to be screened against the 
array in a given application. In one group of embodiments, 
the array has from 10 up to 100 oligonucleotide analogue 
members. In other groups of embodiments, the arrays have 
between 100 and 10,000 members, and in yet other embodi- 
ments the arrays have between 10,000 and 1,000,0000 
members. In preferred embodiments, the array will have a 



;6,501 

2 

density of more than 100 members at known locations per 
cm 2 , or more preferably, more than 1000 members per cm 2 . 
In some embodiments, the arrays have a density of more 
than 10,000 members per cm 2 . 

5 The solid substrate upon which the array is constructed 
includes any material upon which oligonucleotide analogues 
are attached in a defined relationship to one another, such as 
beads, arrays, and slides. Especially preferred oligonucle- 
otide analogues of the array are between about 5 and about 

10 20 nucleotides, nucleotide analogues or a mixture thereof in 
length. 

In one group of embodiments, nucleoside analogues 
incorporated into the oligonucleotide analogues of the array 
will have the chemical formula: 



20 




wherein R 1 and R 2 are independently selected from the 
group consisting of hydrogen, methyl, hydroxy, alkoxy (e.g., 

25 methoxy, ethoxy, propoxy, allyloxy, and propargyloxy), 
alkylthio, halogen (Fluorine, Chlorine, and Bromine), 
cyano, and azido, and wherein Y is a heterocyclic moiety, 
e.g., a base selected from the group consisting of purines, 
purine analogues, pyrimidines, pyrimidine analogues, uni- 

30 versal bases (e.g., 5-nitroindole) or other groups or ring 
systems capable of forming one or more hydrogen bonds 
with corresponding moieties on alternate strands within a 
double- or triple-stranded nucleic acid or nucleic acid 
analogue, or other groups or ring systems capable of forming 

35 nearest-neighbor base-stacking interactions within a double- 
or triple-stranded complex. In other embodiments, the oli- 
gonucleotide analogues are not constructed from 
nucleosides, but are capable of binding to nucleic acids in 
solution due to structural similarities between the oligo- 

40 nucleotide analogue and a naturally occurring nucleic acid. 
An example of such an oligonucleotide analogue is a peptide 
nucleic acid or polyamide nucleic acid in which bases which 
hydrogen bond to a nucleic acid are attached to a polyamide 
backbone. 

45 The present invention also provides target nucleic acids 
hybridized to oligonucleotide arrays. In the target nucleic 
acids of the invention, nucleotide analogues are incorporated 
into the target nucleic acid, altering the hybridization prop- 
erties of the target nucleic acid to an array of oligonucleotide 

50 probes. Typically, the oligonucleotide probe arrays also 
comprise nucleotide analogues. 

The target nucleic acids are typically synthesized by 
providing a nucleotide analogue as a reagent during the 
enzymatic copying of a nucleic acid. For instance, nucle- 

55 otide analogues are incorporated into polynucleic acid ana- 
logues using taq polymerase in a PCR reaction. Thus, a 
nucleic acid containing a sequence to be analyzed is typi- 
cally amplified in a PCR or RNA amplification procedure 
with nucleotide analogues, and the resulting target nucleic 

60 acid analogue amplicon is hybridized to a nucleic acid 
analogue array. 

Oligonucleotide analogue arrays and target nucleic acids 
are optionally composed of oligonucleotide analogues 
which are resistant to hydrolysis or degradation by nuclease 

65 enzymes such as RNAase A. This has the advantage of 
providing the array or target nucleic acid with greater 
longevity by rendering it resistant to enzymatic degradation. 
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For example, analogues comprising 2'-0- 
methyloligoribonucleotides are resistant to RNAase A. 

Oligonucleotide analogue arrays are optionally arranged 
into libraries for screening compounds for desired 
characteristics, such as the ability to bind a specified oligo- 5 
nucleotide analogue, or oligonucleotide analogue- 
containing structure. The libraries also include oligonucle- 
otide analogue members which form conformational ly- 
restricted probes, such as unimolecular double-stranded 
probes or unimolecular double -stranded probes which ]Q 
present a third chemical structure of interest. For instance, 
the array of oligonucleotide analogues optionally include a 
plurality of different members, each member having the 
formula: Y — L 1 — X 1 — L 2 — X 2 , wherein Y is a solid 
substrate, X 1 and X 2 are complementary oligonucleotides 
containing at least one nucleotide analogue, L 1 is a spacer, 15 
and L 2 is a linking group having sufficient length such that 
X 1 and X 2 form a double-stranded oligonucleotide. An array 
of such members comprise a library of unimolecular double- 
stranded oligonucleotide analogues. In another embodiment, 
the members of the array of oligonucleotide are arranged to 20 
present a moiety of interest within the oligonucleotide 
analogue probes of the array. For instance, the arrays are 
optionally conformationally restricted, having the formula 
— X 11 — Z— X 12 , wherein X 11 and X 12 are complementary 
oligonucleotides or oligonucleotide analogues and Z is a 25 
chemical structure comprising the binding site of interest. 

Oligonucleotide analogue arrays are synthesized on a 
solid substrate by a variety of methods, including light- 
directed chemical coupling, and selectively flowing syn- 
thetic reagents over portions of the solid substrate. The solid 30 
substrate is prepared for synthesis or attachment of oligo- 
nucleotides by treatment . with suitable reagents. For 
example, glass is prepared by treatment with silane reagents. 

The present invention provides methods for determining 
whether a molecule of interest binds members of the oligo- 35 
nucleotide analogue array. For instance, in one embodiment, 
a target molecule is hybridized to the array and the resulting 
hybridization pattern is determined. The target molecule 
includes genomic DNA, cDNA, unspliced RNA, mRNA, 
and rRNA, nucleic acid analogues, proteins and chemical 40 
polymers. The target molecules are optionally amplified 
prior to being hybridized to the array, e.g., by PCR, LCR, or 
cloning methods. 

The oligonucleotide analogue members of the array used 
in the above methods are synthesized by any described 45 
method for creating arrays. In one embodiment, the oligo- 
nucleotide analogue members are attached to the solid 
substrate, or synthesized on the solid substrate by light- 
directed very large scale immobilized polymer synthesis, 
e.g., using photo-removable protecting groups during syn- 50 
thesis. In another embodiment, the oligonucleotide members 
are attached to the solid substrate by forming a plurality of 
channels adjacent to the surface of said substrate, placing 
selected monomers in said channels to synthesize oligo- 
nucleotide analogues at predetermined portions of selected 55 
regions, wherein the portion of the selected regions com- 
prise oligonucleotide analogues different from oligonucle- 
otide analogues in at least one other of the selected regions, 
and repeating the steps with the channels formed along a 
second portion of the selected regions. The solid substrate is 60 
any suitable material as described above, including beads, 
slides, and arrays, each of which is constructed from, e.g., 
silica, polymers and glass. 

DEFINITIONS 65 

An "Oligonucleotide" is a nucleic acid sequence com- 
posed of two or more nucleotides. An oligonucleotide is 



optionally derived from natural sources, but is often syn- 
thesized chemically. It is of any size. An "oligonucleotide 
analogue" refers to a polymer with two or more monomelic 
subunits, wherein the subunits have some structural features 
in common with a naturally occurring oligonucleotide which 
allow it to hybridize with a naturally occurring oligonucle- 
otide in solution. For instance, structural groups are option- 
ally added to the ribose or base of a nucleoside for incor- 
poration into an oligonucleotide, such as a methyl or allyl 
group at the 2'-0 position on the ribose, or a fluoro group 
which substitutes for the 2'-0 group, or a bromo group on 
the ribonucleoside base. The phosphodiester linkage, or 
"sugar-phosphate backbone" of the oligonucleotide ana- 
logue is substituted or modified, for instance with methyl 
phosphonates or O-methyl phosphates. Another example of 
an oligonucleotide analogue for purposes of this disclosure 
includes "peptide nucleic acids" in which native or modified 
nucleic acid bases are attached to a polyamide backbone. 
Oligonucleotide analogues optionally comprise a mixture of 
naturally occurring nucleotides and nucleotide analogues. 
However, an oligonucleotide which is made entirely of 
naturally occurring nucleotides (i.e., those comprising DNA 
or RNA), with the exception of a protecting group on the end 
of the oligonucleotide, such as a protecting group used 
during standard nucleic acid synthesis is not considered an 
oligonucleotide analogue for purposes of this invention. 

A "nucleoside" is a pentose glycoside in which the 
aglycone is a heterocyclic base; upon the addition of a 
phosphate group the compound becomes a nucleotide. The 
major biological nucleosides are p -glycoside derivatives of 
D-ribose or D-2-deoxyribose. Nucleotides are phosphate 
esters of nucleosides which are generally acidic in solution 
due to the hydroxy groups on the phosphate. The nucleo- 
sides of DNA and RNA are connected together via phos- 
phate units attached to the 3' position of one pentose and the 
5' position of the next pentose. Nucleotide analogues and/or 
nucleoside analogues are molecules with structural similari- 
ties to the naturally occurring nucleotides or nucleosides as 
discussed above in the context of oligonucleotide analogues. 

A "nucleic acid reagent" utilized in standard automated 
oligonucleotide synthesis typically caries a protected phos- 
phate on the y hydroxyl of the ribose. Thus, nucleic acid 
reagents are referred to as nucleotides, nucleotide reagents, 
nucleoside reagents, nucleoside phosphates, nucleoside-3- 
phosphates, nucleoside phosphoramidites, 
phosphoramidites, nucleoside phosphonates, phosphonates 
and the like. It is generally understood that nucleotide 
reagents carry a reactive, or activatible, phosphoryl or 
phosphonyl moiety in order to form a phosphodiester link- 
age. 

A "protecting group" as used herein, refers to any of the 
groups which are designed to block one reactive site in a 
molecule while a chemical reaction is carried out at another 
reactive site. More particularly, the protecting groups used 
herein are optionally any of those groups described in 
Greene, et al. , Protective Groups In Organic Chemistry, 2nd 
Ed., John Wiley & Sons, New York, NY, 1991, which is 
incorporated herein by reference. The proper selection of 
protecting groups for a particular synthesis is governed by 
the overall methods employed in the synthesis. For example, 
in "light-directed" synthesis, discussed herein, the protect- 
ing groups are photolabile protecting groups such as NVOC, 
MeNPoc, and those disclosed in co-pending Application 
PCT/US93/10162 (filed Oct. 22, 1993), incorporated herein 
by reference. In other methods, protecting groups are 
removed by chemical methods and include groups such as 
FMOC, DMT and others known to those of skill in the art. 
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A "purine" is a generic term based upon the specific 
compound "purine" having a skeletal structure derived from 
the fusion of a pyrimidine ring and an imidazole ring. It is 
generally, and herein, used to describe a generic class of 
compounds which have an atom or a group of atoms added 5 
to the parent purine compound, such as the bases found in 
the naturally occurring nucleic acids adenine 
(6-aminopurine) and guanine (2-amino-6-oxopurine), or less 
commonly occurring molecules such as 2-amino-adenine, 
N 6 -methyladenine, or 2-methylguanine. 10 

A "purine analogue" has a heterocyclic ring with struc- 
tural similarities to a purine, in which an atom or group of 
atoms is substituted for an atom in the purine ring. For 
instance, in one embodiment, one or more N atoms of the 
purine heterocyclic ring are replaced by C atoms. 15 

A "pyrimidine" is a compound with a specific heterocy- 
clic diazine ring structure, but is used generically by persons 
of skill and herein to refer to any compound having a 
1,3 -diazine ring with minor additions, such as the common 
nucleic acid bases cytosine, thymine, uracil, 20 
5-methylcytosine and 5-hydroxymethylcytosine, or the non- 
naturally occurring 5-bromo-uracil. 

A "pyrimidine analogue" is a compound with structural 
similarity to a pyrimidine, in which one or more atom in the 25 
pyrimidine ring is substituted. For instance, in one 
embodiment, one or more of the N atoms of the ring are 
substituted with C atoms. 

A "solid substrate" has fixed organizational support 
matrix, such as silica, polymeric materials, or glass. In some 30 
embodiments, at least one surface of the substrate is partially 
planar. In other embodiments it is desirable to physically 
separate regions of the substrate to delineate synthetic 
regions, for example with trenches, grooves, wells or the 
like. Example of solid substrates include slides, beads and 35 
arrays. 

DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows four panels (FIG. 1A, FIG. IB, FIG. 1C and 
FIG. ID). FIGS. 1A and IB graphically display the differ- 40 
ence in fluorescence intensity between the matched and 
mismatched DNA probes. FIGS. 1C and ID illustrate the 
difference in fluorescence intensity verses location on an 
example chip for DNA and RNA targets, respectively. 

FIG. 2 is a graphic illustration of specific light-directed 45 
chemical coupling of oligonucleotide analogue monomers to 
an array. 

FIG. 3 shows the relative efficiency and specificity of 
hybridization for immobilized probe arrays containing ^ 
adenine versus probe arrays containing 2,6-diaminopurine 
nucleotides. (3-CATCGTAGAA-5' (SEQ ID NO:l)). 

FIG. 4 shows the effect of substituting adenine with 
2,6-diaminopurine (D) in immobilized poly-dA probe 
arrays. (AAAAANAAAAA (SEQ ID NO:2)). 55 

FIG. 5 shows the effects of substituting 5-propynyl-2- 
deoxyuridine and 2-amino-2' deoxyadenosine in AT arrays 
on hybridization to a target nucleic acid. (ATATAATArA 
(SEQ ID NO:3) and CGCGCCGCGC (SEQ ID NO:4)). 

FIG. 6 shows the effects of dl and 7-deaza-dG substitu- 60 
tions in oligonucleotide arrays. (3'-ATGTT(GlG2G3G4G5) 
CGGGT-5' (SEQ ID NO:5)). 

DETAILED DESCRIPTION 

Methods of synthesizing desired single stranded oligo- 65 
nucleotide and oligonucleotide analogue sequences are 
known to those of skill in the art. In particular, methods of 



,501 

6 

synthesizing oligonucleotides and oligonucleotide ana- 
logues are found in, for example, Oligonucleotide Synthesis: 
A Practical Approach, Gait, ed., IRL Press, Oxford (1984); 
W. H. A. Kuijpers Nucleic Acids Research 18(17), 5197 
(1994); K. L. Dueholm J. Org. Chem. 59, 5767-5773 
(1994), and S. Agrawal (ed.) Methods in Molecular Biology, 
volume 20, each of which is incorporated herein by refer- 
ence in its entirety for all purposes. Synthesizing unimo- 
lecular double-stranded DNA in solution has also been 
described. See, copending application Ser. No. 08/327,687, 
now U.S. Pat. No. 5,556,752 which is incorporated herein 
for all purposes. 

Improved methods of forming large arrays of 
oligonucleotides, peptides and other polymer sequences 
with a minimal number of synthetic steps are known. See, 
Pirrung et al., U.S. Pat. No. 5,143,854 (see also, PCT 
Application No. WO 90/15070) and Fodor et al., PCT 
Publication No. WO 92/10092, which are incorporated 
herein by reference, which disclose methods of forming vast 
arrays of peptides, oligonucleotides and other molecules 
using, for example, light-directed synthesis techniques. See 
also, Fodor el al., (1991) Science, 251, 767-77 which is 
incorporated herein by reference for all purposes. These 
procedures for synthesis of polymer arrays are now referred 
to as VLSIPS™ procedures. 

Using the VLSIP™ approach, one heterogenous array of 
polymers is converted, through simultaneous coupling at a 
number of reaction sites, into a different heterogenous array. 
See, U.S. application Ser. No. 07/796,243 now U.S. Pat. No. 
5,384,261 and U.S. application Ser. No. 07/980,523 now 
U.S. Pat. No. 5,677,195, the disclosures of which are incor- 
porated herein for all purposes. 

The development of VLSIPS™ technology as described 
in the above-noted U.S. Pat. No. 5,143,854 and PCT patent 
publication Nos. WO 90/15070 and 92/10092 is considered 
pioneering technology in the fields of combinatorial synthe- 
sis and screening of combinatorial libraries. More recently, 
patent application Ser. No. 08/082,937, filed Jun. 25, 1993 
(incorporated herein by reference), describes methods for 
making arrays of oligonucleotide probes that are used to 
check or determine a partial or complete sequence of a target 
nucleic acid and to detect the presence of a nucleic acid 
containing a specific oligonucleotide sequence. 

Combinatorial Synthesis of Oligonucleotide Arrays 

VLSIPS™ technology provides for the combinatorial 
synthesis of oligonucleotide arrays. The combinatorial 
VLSIPS™ strategy allows for the synthesis of arrays con- 
taining a large number of related probes using a minimal 
number of synthetic steps. For instance, it is possible to 
synthesize and attach all possible DNA 8mer oligonucle- 
otides (4 s , or 65,536 possible combinations) using only 32 
chemical synthetic steps. In general, VLSIPS™ procedures 
provide a method of producing 4 n different oligonucleotide 
probes on an array using only 4n synthetic steps. 

In brief, the light-directed combinatorial synthesis of 
oligonucleotide arrays on a glass surface proceeds using 
automated phosphoramidite chemistry and chip masking 
techniques. In one specific implementation, a glass surface 
is derivatized with a silane reagent containing a functional 
group, e.g., a hydroxyl or amine group blocked by a pho- 
tolabile protecting group. Photolysis through a photolithog- 
aphic mask is used selectively to expose functional groups 
which are then ready to react with incoming 
5'-photoprotected nucleoside phosphoramidites. See, FIG. 2. 
The phosphoramidites react only with those sites which are 
illuminated (and thus exposed by removal of the photolabile 
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blocking group). Thus, the phosphoramidites only add to 
those areas selectively exposed from the preceding step. 
These steps are repeated until the desired array of sequences 
have been synthesized on the solid surface. Combinatorial 
synthesis of different oligonucleotide analogues at different 5 
locations on the array is determined by the pattern of 
illumination during synthesis and the order of addition of 
coupling reagents. 

In the event that an oligonucleotide analogue with a 
polyamide backbone is used in the VLSIPS™ procedure, it 10 
is generally inappropriate to use phosphoramidite chemistry 
to perform the synthetic steps, since the monomers do not 
attach to one another via a phosphate linkage. Instead, 
peptide synthetic method are substituted. See, e.g., Pirrung 
et al. U.S. Pat. No. 5,143,854. 

Peptide nucleic acids are commercially available from, 
e.g., Biosearch, Inc. (Bedford, Mass.) which comprise a 
polyamide backbone and the bases found in naturally occur- 
ring nucleosides. Peptide nucleic acids are capable of bind- 
ing to nucleic acids with high specificity, and are considered 
"oligonucleotide analogues" for purposes of this disclosure. 
Note that peptide nucleic acids optionally comprise bases 
other than those which are naturally occurring. 

Hybridization of Nucleotide Analogues 25 

The stability of duplexes formed between RNAs or DNAs 
are generally in the order of 
RNA:RNA>RNA:DNA>DNA:DNA, in solution. Long 
probes have better duplex stability with a target, but poorer 
mismatch discrimination than shorter probes (mismatch 30 
discrimination refers to the measured hybridization signal 
ratio between a perfect match probe and a single base 
mismatch probe. Shorter probes (e.g., 8-mers) discriminate 
mismatches very well, but the overall duplex stability is low. 
In order to optimize mismatch discrimination and duplex 35 
stability, the present invention provides a variety of nucle- 
otide analogues incorporated into polymers and attached in 
an array to a solid substrate. 

Altering the thermal stability (T m ) of the duplex formed 
between the target and the probe using, e.g., known oligo- 40 
nucleotide analogues allows for optimization of duplex 
stability and mismatch discrimination. One useful aspect of 
altering the T m arises from the fact that Adenine-Thymine 
(A-T) duplexes have a lower T m than Guanine-Cytosine 
(G-Q duplexes, due in part to the fact that the A-T duplexes 45 
have 2 hydrogen bonds per base -pair, while the G-C 
duplexes have 3 hydrogen bonds per base pair. In hetero- 
geneous oligonucleotide arrays in which there is a non- 
uniform distribution of bases, it can be difficult to optimize 
hybridization conditions for all probes simultaneously. Thus, 50 
in some embodiments, it is desirable to destabilize G-C-rich 
duplexes and/or to increase the stability of A-T-rich duplexes 
while maintaining the sequence specificity of hybridization. 
This is accomplished, e.g., by replacing one or more of the 
native nucleotides in the probe (or the target) with certain 55 
modified, non-standard nucleotides. Substitution of guanine 
residues with 7-deazaguanine, for example, will generally 
destabilize duplexes, whereas substituting adenine residues 
with 2,6-diaminopurine will enhance duplex stability. A 
variety of other modified bases are also incorporated into 60 
nucleic acids to enhance or decrease overall duplex stability 
while maintaining specificity of hybridization. The incorpo- 
ration of 6-aza-pyrimidine analogs into oligonucleotide 
probes generally decreases their binding affinity for comple- 
mentary nucleic acids. Many 5-substituted pyrimidines sub- 65 
stantially increase the stability of hybrids in which they have 
been substituted in place of the native pyrimidines in the 



sequence. Examples include 5-bromo-, 5-methyl-, 
5-propynyl-, 5-(imidazol-2-yl)-and 5-(thiazol-2-yl)- 
derivatives of cytosine and uracil. 

Many modified nucleosides, nucleotides and various 
bases suitable for incorporation into nucleosides are com- 
mercially available from a variety of manufacturers, includ- 
ing the SIGMA chemical company (Saint Louis, Mo.), R&D 
systems (Minneapolis, Minn.), Pharmacia LKB Biotechnol- 
ogy (Piscataway, NJ.), CLONTECH Laboratories, Inc. 
(Palo Alto, Calif.), Chem Genes Corp., Aldrich Chemical 
Company (Milwaukee, Wis.), Glen Research, Inc., GIBCO 
BRL life Technologies, Inc. (Gaithersberg, Md.), Fluka 
Chemica-Biochemika Analytika (Fluka Chemie AG, Buchs, 
Switzerland), Invitrogen, San Diego, Calif., and Applied 
Biosystems (Foster City, Calif.), as well as many other 
commercial sources known to one of skill. Methods of 
attaching bases to sugar moieties to form nucleosides are 
known. See, e.g., Lukevics and Zablocka (1991), Nucleoside 
Synthesis: Organosilicon Methods Ellis Horwood Limited 
Chichester, West Sussex, England and the references therein. 
Methods of phosphorylating nucleosides to form 
nucleotides, and of incorporating nucleotides into oligo- 
nucleotides are also known. See, e.g., Agrawal (ed) (1993) 
Protocols for Oligonucleotides and Analogues, Synthesis 
and Properties, Methods in Molecular Biology volume 20, 
Humana Press, Towota, NJ., and the references therein. See 
also, Crooke and Lebleu, and Sanghvi and Cook, and the 
references cited therein, both supra. 

Groups are also linked to various positions on the nucleo- 
side sugar ring or on the purine or pyrimidine rings which 
may stabilize the duplex by electrostatic interactions with 
the negatively charged phosphate backbone, or through 
hydrogen bonding interactions in the major and minor 
groves. For example, adenosine and guanosine nucleotides 
are optionally substituted at the N 2 position with an imida- 
zolyl propyl group, increasing duplex stability. Universal 
base analogues such as 3-nitropyrrole and 5-nitroindole are 
optionally included in oligonucleotide probes to improve 
duplex stability through base stacking interactions. 

Selecting the length of oligonucleotide probes is also an 
important consideration when optimizing hybridization 
specificity. In general, shorter probe sequences are more 
specific than longer ones, in that the occurrence of a single- 
base mismatch has a greater destabilizing effect on the 
hybrid duplex. However, as the overall thermodynamic 
stability of hybrids decreases with length, in some embodi- 
ments it is desirable to enhance duplex stability for short 
probes globally. Certain modifications of the sugar moiety in 
oligonucleotides provide useful stabilization, and these can 
be used to increase the affinity of probes for complementary 
nucleic acid sequences. For example, 2'-0-methyl-, 2'-0- 
prppyl-, and 2'-0-aUyl-oligoribonucleotides have higher 
binding affinities for complementary RNA sequences than 
their unmodified counterparts. Probes comprised of 
2'-fluoro-2 f -deoxyollgoribonucleotides also form more 
stable hybrids with RNA than do their unmodified counter- 
parts. 

Replacement or substitution of the internucleotide phos- 
phodiester linkage in oligo- or poly-nucleotides is also used 
to either increase or decrease the affinity of probe-target 
interactions. For example, substituting phosphodiester link- 
ages with phosphorothioate or phosphorodithioate linkages 
generally lowers duplex stability, without affecting sequence 
specificity. Substitutions with a non-ionic methylphospho- 
nate linkage (racemic, or preferably, Rp stereochemistry) 
have a stabilizing influence on hybrid formation. Neutral or 
cationic phosphoramidate linkages also result in enhanced 
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duplex stabilization. The phosphate diester backbone has 
been replaced with a variety of other stabilizing, non-natural 
linkages which have been studied as potential antisense 
therapeutic agents. See, e.g., Crooke and Lebleu (eds) 
(1993) Antisense Research Applications CRC Press; and, 5 
Sanghvi and Cook (eds) (1994) Carbohydrate modifications 
in Antisense Research ACS Symp. Ser. #580 ACS, Wash- 
ington DC. Very stable hybrids are formed between nucleic 
acids and probes comprised of peptide nucleic acids, in 
which the entire sugar-phosphate backbone has been Q 
replaced with a poly amide structure. 

Another important factor which sometimes affects the use 
of oligonucleotide probe arrays is the nature of the target 
nucleic acid. Oligodeoxynucleotide probes can hybridize to 
DNA and RNA targets with different affinity and specificity. 15 
For example, probe sequences containing long "runs" of 
consecutive deoxy adenosine residues form less stable 
hybrids with complementary RNA sequences than with the 
complementary DNA sequences. Substitution of dA in the 
probe with either 2,6-diaminopurine deoxyriboside, or 2 q 
2'-alkoxy- or 2'-fluoro-dA enhances hybridization with RNA 
targets. 

Internal structure within nucleic acid probes or the targets 
also influences hybridization efficiency. For example, 
GC-rich sequences, and sequences containing "runs" of 2 5 
consecutive G residues frequently self-associate to form 
higher-order structures, and this can inhibit their binding to 
complementary sequences. See, Zimmermann et al. (1975) 
J. Mol Biol 92: 181; Kim (1991) Nature 351: 331; Sen and 
Gilbert (1988) Nature 335: 364; and Sunquist and Klug 30 
(1989) Nature 342: 825. These structures are selectively 
destabilized by the substitution of one or more guanine 
residues with one or more of the following purines or purine 
analogs: 7-deazaguanine, 8-aza-7-deazaguanine, 
2-aminopurine, lH-purine, and hypoxanthine, in order to 35 
enhance hybridization. 

Modified nucleic acids and nucleic acid analogs can also 
be used to improve the chemical stability of probe arrays. 
For example, certain processes and conditions that are useful 
for either the fabrication or subsequent use of the arrays, 40 
may not be compatible with standard oligonucleotide 
chemistry, and alternate chemistry can be employed to 
overcome these problems. For example, exposure to acidic 
conditions will cause depurination of purine nucleotides, 
ultimately resulting in chain cleavage and overall degrada- 45 
tion of the probe array. In this case, adenine and guanine are 
replaced with 7-deazaadenine and 7-deazaguanine, 
respectively, in order to stabilize the oligonucleotide probes 
towards acidic conditions which are used during the manu- 
facture or use of the arrays. 50 

Base, phosphate and sugar modifications are used in 
combination to make highly modified oligonucleotide ana- 
logues which take advantage of the properties of each of the 
various modifications. For example, oligonucleotides which 
have higher binding affinities for complementary sequences 55 
than their unmodified counterparts (e.g., 2'-0-methyl-, 2'-0- 
propyl-, and 2'-0-allyl oligonucleotides) can be incorpo- 
rated into oligonucleotides with modified bases 
(deazaguanine, 8-aza-7-deazaguanine, 2-aminopurine, 
lH-purine, hypoxanthine and the like) with non-ionic meth- 60 
ylphosphonate linkages or neutral or cationic phosphorami- 
date linkages, resulting in additive stabilization of duplex 
formation between the oligonucleotide and a target nucleic 
acid. For instance, one preferred oligonucleotide comprises 
a 2 t -0-methyl-2,6'diaminopurineriboside phosphorothioate. 65 
Similarly, any of the modified bases described herein can be 
incorporated into peptide nucleic acids, in which the entire 
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sugar-phosphate backbone has been replaced with a polya- 
mide structure. 

Thermal equilibrium studies, kinetic "on-rate" studies, 
and sequence specificity analysis is optionally performed for 
any target oligonucleotide and probe or probe analogue. The 
data obtained shows the behavior of the analogues upon 
duplex formation with target oligonucleotides. Altered 
duplex stability conferred by using oligonucleotide analogue 
probes are ascertained by following, e.g., fluorescence signal 
intensity of oligonucleotide analogue arrays hybridized with 
a target oligonucleotide over time. The data allow optimi- 
zation of specific hybridization conditions at, e.g., room 
temperature (for simplified diagnostic applications). 

Another way of verifying altered duplex stability is by 
following the signal intensity generated upon hybridization 
with time. Previous experiments using DNA targets and 
DNA chips have shown that signal intensity increases with 
time, and that the more stable duplexes generate higher 
signal intensities faster than less stable duplexes. The signals 
reach a plateau or "saturate" after a certain amount of time 
due to all of the binding sites becoming occupied. These data 
allow for optimization of hybridization, and determination 
of equilibration conditions at a specified temperature. 

Graphs of signal intensity and base mismatch positions 
are plotted and the ratios of perfect match versus mis- 
matches calculated. This calculation shows the sequence 
specific properties of nucleotide analogues as probes. Per- 
fect match/mismatch ratios greater than 4 are often desirable 
in an oligonucleotide diagnostic assay because, for a diploid 
genome, ratios of 2 have to be distinguished (e.g., in the case 
of a heterozygous trait or sequence). 

Target Nucleic Acids Which Comprise Nucleotide Ana- 
logues 

Modified nucleotides and nucleotide analogues are incor- 
porated synthetically or enzymatically into DNA or RNA 
target nucleic acids for hybridization analysis to oligonucle- 
otide arrays. The incorporation of nucleotide analogues in 
the target optimizes the hybridization of the target in terms 
of sequence specificity and/or the overall affinity of binding 
to oligonucleotide and oligonucleotide analogue probe 
arrays. The use of nucleotide analogues in either the oligo- 
nucleotide array or the target nucleic acid, or both, improves 
optimizability of hybridization interactions. Examples of 
useful nucleotide analogues which are substituted for natu- 
rally occurring nucleotides include 7-deazaguanosine, 2,6- 
diaminopurine nucleotides, 5-propynyl and other 
5-substituted pyrimidine nucleotides, 2 , -fluro and 
2-methoxy -2'-deoxy nucleotides and the like. 

These nucleotide analogues are incorporated into nucleic 
acids using the synthetic methods described supra, or using 
DNA or RNA polymerases. The nucleotide analogues are 
preferably incorporated into target nucleic acids using in 
vitro amplification methods such as PCR, LCR, 
Qp-replicase expansion, in vitro transcription (e.g., nick 
translation or random-primer transcription) and the like. 
Alternatively, the nucleotide analogues are optionally incor- 
porated into cloned nucleic acids by culturing a cell which 
comprises the cloned nucleic acid in media which includes 
a nucleotide analogue. 

Similar to the use of nucleotide analogues in probe arrays, 
7-deazaguanosine is used in target nucleic acids to substitute 
for G/dG to enhance target hybridization by reducing sec- 
ondary structure in sequences containing runs of poly-G/dG. 
6diaminopurine nucleotides substitute for A/dA to enhance 
target hybridization through enhanced H-bonding to T or U 
rich probes. 5-propynyl and other 5-substituted pyrimidine 
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nucleotides substitute for natural pyrimidines to enhance carboxyl. Preferred surface attaching or derivitizing portions 

target hybridization to certain purine rich probes. 2'-fluro include aminoalkylsilanes and hydroxyalkylsilanes. In par- 

and 2 , -methoxy-2 l -deoxynucleotides substitute for natural ticularly preferred embodiments, the surface attaching por- 

nucleotides to enhance target hybridization to similarly tion of the oligonucleotide analogue is either bis(2- 

substituted probe sequences. 5 hydroxyethyl)-aminopropyltriethoxysilane, n-(3- 

o ■ c ct u * . * A| « || i ! t .j triethoxysilylpropyl)-4-hydroxybutylamide, 

Synthesis of 5'-photoprotected 2'-0 alkyl ribonucleotide ^^^i^^J or hydroxypropyltriethoxysi- 

analogues lane 

The light-directed synthesis of complex arrays of nucle- oligoribonucleotides generated by synthesis using 

otide analogues on a glass surface is achieved by derivatiz- ordinary ribonucleotides are usually base labile due to the 

ing cyanoethyl phosphoramidite nucleotides and nucleotide presence of the 2 , -hydroxyl group. 2'-0- 

analogues (e.g., nucleoside analogues of uridine, thymidine, methyloligoribonucleotides (2 f -OMeORNs), analogues of 

cytidine, adenosine and guanosine, with phosphates) with, RNA where the 2'-hydroxyl group is methylated, are DNAse 

for example, the photolabile MeNPoc group in the and RNAse resistant, making them less base labile. Sproat, 

5'-hydroxyl position instead of the usual dimethoxytrityl B. S., andLamond, A. I. in Oligonucleotides and Analogues: 

group. See, application SN PCT/US94/12305. 15 A Practical Approach, edited by F. Eckstein, New York: IRL 

Specific base-protected 2-0 alkyl nucleosides are com- Press at 0xford University Press, 1991, pp. 49-36, incor- 

mercially available, from, e.g., Chem Genes Corp. (MA). P orated herein °y reference for all purposes, have reported 

The photolabile MeNPoc group is added to the 5'-hydroxyl ^ e synthesis of mixed sequences of 2'-0-Methoxy- 

position followed by phosphitylation to yield cyanoethyl n oligoribonucleotides (2'-0-MeORNs) using dimethoxytrityl 

phosphoramidite monomers. Commercially available 20 phosphoramidite chemistry. These 2'-0-MeORNs display 

nucleosides are optionally modified (e.g., by 2-0-alkyIation) g reater binding affinity for complementary nucleic acids 

to create nucleoside analogues which are used to generate their unmodified counterparts, 

oligonucleotide analogues. Other embodiments of the invention provide mechanical 

Modifications to the above procedures are used in some 25 means t0 ^trztt oligonucleotide analogues. These tech- 

embodiments to avoid significant addition ofMenPoc to the ni£ l ues are d^cussed in co-pending application Ser. No 

3'-hydroxyl position. For instance, in one embodiment, a 07/796,243, filed Nov. 22, 1991, which is incorporated 

2'-0-methyl ribonucleotide analogue is reacted with DMT- herein b X reference in its entirety for all purposes 

CI {di(p-methoxyphenyl)phenylchloride} in the presence of Essentially, oligonucleotide analogue reagents are directed 

pyridine to generate a 2'-0-methyl-5'-0-DMT ribonucle- M over surface of a substrate such that a predefined array 

otide analogue. This allows for the addition of TBDMS to of oligonucleotide analogues is created. For instance, a 

the 3 f -0 of the ribonucleoside analogue by reaction with « ncs of channels, grooves, or spots are formed on or 

TBDMS-Triflate (t-butyldimethylsilyltrifluoromethane- ad i acent t0 a substrate. Reagents are selectively flowed, 

sulfonate) in the presence of triethylamine in THF throu S h or deposited in the channels, grooves, or spots, 

(tetrahydrofuran)toyielda2'-0-methyl-3 , -0-TBDMS-5 t -0- 3S formin S an arra y havin S different oligonucleotides and/or 

DMT ribonucleotide base analogue. This analogue is treated oligonucleotide analogues at selected locations on the sub- 

with TCAA (trichloroacetic acid) to cleave off the DMT strate * 

group, leaving a reactive hydroxyl group at the 5' position. Detection of Hybridization 

MeNPoc is then added to the oxygen of the 5' hydroxyl In one embodiment, hybridization is detected by labeling 

group using MenPoc-Ci in the presence of pyridine. The 40 a target with, e.g., fluorescein or other known visualization 

TBDMS group is then cleaved with F~ (e.g., NaF) to yield agents and incubating the target with an array of oligonucle- 

a ribonucleotide base analogue with a MeNPoc group otide analogue probes. Upon duplex formation by the target 

attached to the 5* oxygen on the nucleotide analogue. If with a probe in the array (or triplex formation in embodi- 

appropriate, this analogue is phosphitylated to yield a phos- ments where the array comprises unimolecular double- 

phoramidite for oligonucleotide analogue synthesis. Other 45 stranded probes), the fluorescein label is excited by, e.g., an 

nucleosides or nucleoside analogues are protected by similar argon laser and detected by viewing the array, e.g., through 

procedures. a scanning confocal microscope. 

Synthesis of Oligonucleotide Analogue Arrays on Chips Sequencing by hybridization 

Other than the use of photoremovable protecting groups, Current sequencing methodologies are highly reliant on 

the nucleoside coupling chemistry used in VLSIPS™ tech- 50 complex procedures and require substantial manual effort, 

nology for synthesizing oligonucleotides and oligonucle- Conventional DNA sequencing technology is a laborious 

otide analogues on chips is similar to that used for oligo- procedure requiring electrophoretic size separation of 

nucleotide synthesis. The oligonucleotide is typically linked labeled DNA fragments. Ah alternative approach involves a 

to the substrate via the 3'-hydroxyl group of the oligonucle- hybridization strategy carried out by attaching target DNA to 

otide and a functional group on the substrate which results 55 a surface. The target is interrogated with a set of oligonucle- 

in the formation of an ether, ester, carbamate or phosphate otide probes, one at a time (see, application SN PCT/US94/ 

ester linkage. Nucleotide or oligonucleotide analogues are 12305). 

attached to the solid support via carbon-carbon bonds using, A preferred method of oligonucleotide probe array syn- 

for example, supports having (poly)trifluorochloroethylene thesis involves the use of light to direct the synthesis of 

surfaces, or preferably, by siloxane bonds (using, for 60 oligonucleotide analogue probes in high-density, miniatur- 

example, glass or silicon oxide as the solid support). Silox- ized arrays. Matrices of spatially-defined oligonucleotide 

ane bonds with the surface of the support are formed in one analogue probe arrays were generated. The ability to use 

embodiment via reactions of surface attaching portions these arrays to identify complementary sequences was dem- 

bearing trichlorosilyl or trialkoxysilyl groups. The surface onstrated by hybridizing fluorescent labeled oligomicle- 

attaching groups have a site for attachment of the oligo- 65 otides to the matrices produced. 

nucleotide analogue portion. For example, groups which are Oligonucleotide analogue arrays are used, e.g., to study 

suitable for attachment include amines, hydroxyl, thiol, and sequence specific hybridization of nucleic acids, or protein- 
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nucleic acid interactions. Oligonucleotide analogue arrays 
are used to define the thermodynamic and kinetic rules 
governing the formation and stability of oligonucleotide and 
oligonucleotide analogue complexes. 

Oligonucleotide analogue Probe Arrays and Libraries 

The use of oligonucleotide analogues in probe arrays 
provides several benefits as compared to standard oligo- 
nucleotide arrays. For instance, as discussed supra, certain 
oligonucleotide analogues have enhanced hybridization 
characteristics to complementary nucleic acids as compared 
with oligonucleotides made of naturally occurring nucle- 
otides. One primary benefit of enhanced hybridization char- 
acteristics is that oligonucleotide analogue probes are 
optionally shorter than corresponding probes which do not 
include nucleotide analogues. 

Standard oligonucleotide probe arrays typically require 
fairly long probes (about 15-25 nucleotides) to achieve 
strong binding to target nucleic acids. The use of such long 
probes is disadvantageous for two reasons. First, the longer 
the probe, the more synthetic steps must be performed to 
make the probe and any probe array comprising the probe. 
This increases the cost of making the probes and arrays. 
Furthermore, as each synthetic step results in less than 100% 
coupling for every nucleotide, the quality of the probes 
degrades as they become longer. Secondly, short probes 
provide better mis-match discrimination for hybridization to 
a target nucleic acid. This is because a single base mismatch 
for a short probe-target hybridization is less destabilizing 
than a single mismatch for a long probe-target hybridization. 
Thus, it is harder to distinguish a single probe-target mis- 
match when the probe is a 20-mer than when the probe is an 
8-mer. Accordingly, the use of short oligonucleotide ana- 
logue probes reduces costs and increases mismatch discrimi- 
nation in probe arrays. 

The enhanced hybridization characteristics of oligonucle- 
otide analogues also allows for the creation of oligonucle- 
otide analogue probe arrays where the probes in the arrays 
have substantial secondary structure. For instance, the oli- 
gonucleotide analogue probes are optionally configured to 
be fully or partially double stranded on the array. The probes 
are optionally complexed with complementary nucleic 
acids, or are optionally unimolecular oligonucleotides with 
self-complementary regions. Libraries of diverse double- 
stranded oligonucleotide analogue probes are used, for 
example, in screening studies to determine binding affinity 
of nucleic acid binding proteins, drugs, or oligonucleotides 
(e.g., to examine triple helix formation). Specific oligonucle- 
otide analogues are known to be conducive to the formation 
of unusual secondary structure. See, Durland (1995) Bio- 
conjugate Chem. 6: 278-282. General strategies for using 
unimolecular double-stranded oligonucleotides as probes 
and for library generation is described in application Ser. No 
08/327,687, and similar strategies are applicable to oligo- 
nucleotide analogue probes. 

In general, a solid support, which optionally has an 
attached spacer molecule is attached to the distal end of the 
oligonucleotide analogue probe. The probe is attached as a 
single unit, or synthesized on the support or spacer in a 
monomer by monomer approach using the VLSI PS™ or 
mechanical partitioning methods described supra. Where the 
oligonucleotide analogue arrays are fully double-stranded, 
oligonucleotides (or oligonucleotide analogues) comple- 
mentary to the probes on the array are hybridized to the 
array. 

In some embodiments, molecules other than 
oligonucleotides, such as proteins, dyes, co-factors, linkers 
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and the like are incorporated into the oligonucleotide ana- 
logue probe, or attached to the distal end of the oligomer, 
e.g., as a spacing molecule, or as a probe or probe target. 
Flexible linkers are optionally used to separate complemen- 
tary portions of the oligonucleotide analogue. 

The present invention also contemplates the preparation 
of libraries of oligonucleotide analogues having bulges or 
loops in addition to complementary regions. Specific RNA 
bulges are often recognized by proteins (e.g., TAR RNA is 
recognized by the TAT protein of HIV). Accordingly, librar- 
ies of oligonucleotide analogue bulges or loops are useful in 
a number of diagnostic applications. The bulge or loop can 
be present in the oligonucleotide analogue or linker portions. 

Unimolecular analogue probes can be configured in a 
variety of ways. In one embodiment, the unimolecular 
probes comprise linkers, for example, where the probe is 
arranged according to the formula Y — L 1 — X — L — X , in 
which Y represents a solid support, X 1 and X 2 represent a 
pair of complementary oligonucleotides or oligonucleotide 
analogues, L 1 represents a bond or a spacer, and L 2 repre- 
sents a linking group having sufficient length such that X 
and X 2 form a double-stranded oligonucleotide. The general 
synthetic and conformational strategy used in generating the 
double-stranded unimolecular probes is similar to that 
described in co-pending application Ser. No. 08/327,687, 
except that any of the elements of the probe (L 1 , X 1 , L 2 and 
X 2 ) comprises a nucleotide or an oligonucleotide analogue. 
For instance, in one embodiment X 1 is an oligonucleotide 
analogue. 

The oligonucleotide analogue probes are optionally 
arranged to present a variety of moieties. For example, 
structural components are optionally presented from the 
middle of a conformationally restricted oligonucleotide ana- 
logue probe. In these embodiments, the analogue probes 
generally have the structure— X 1 — Z— X 2 wherein X 11 and 
X 12 are complementary oligonucleotide analogues and Z is 
a structural element presented away from the surface of the 
probe array. Z can include an agonist or antagonist for a cell 
membrane receptor, a toxin, venom, viral epitope, hormone, 
peptide, enzyme, cofactor, drug, protein, antibody or the 
like. 

General tiling strategies for detection of a Polymorphism 
in a target oligonucleotide 

In diagnostic applications, oligonucleotide analogue 
arrays (e.g., arrays on chips, slides or beads) are used to 
determine whether there are any differences between a 
reference sequence and a target oligonucleotide, e.g., 
whether an individual has a mutation or polymorphism in a 
known gene. As discussed supra, the oligonucleotide target 
is optionally a nucleic acid such as a PCR amplicon which 
comprises one or more nucleotide analogues. In one 
embodiment, arrays are designed to contain probes exhib- 
iting complementarity to one or more selected reference 
55 sequence whose sequence is known. The arrays are used to 
read a target sequence comprising either the reference 
sequence itself or variants of that sequence. Any polynucle- 
otide of known sequence is selected as a reference sequence. 
Reference sequences of interest include sequences known to 
include mutations or polymorphisms associated with phe- 
notypic changes having clinical significance in human 
patients. For example, the CFTR gene and P53 gene in 
humans have been identified as the location of several 
mutations resulting in cystic fibrosis or cancer respectively. 
Other reference sequences of interest include those that 
serve to identify pathogenic microorganisms and/or are the 
site of mutations by which such microorganisms acquire 
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drug resistance (e.g., the HIV reverse transcriptase gene for 
HIV resistance). Other reference sequences of interest 
include regions where polymorphic variations are known to 
occur (e.g., the D-loop region of mitochondrial DNA). 
These reference sequences also have utility for, e.g., 5 
forensic, cladistic, or epidemiological studies. 

Other reference sequences of interest include those from 
the genome of pathogenic viruses (e.g., hepatitis (A, B, or 
Q, herpes virus (e.g., VZV, HSV-1, HAV-6, HSV-II, CMV, 
and Epstein Barr virus), adenovirus, influenza virus, ]0 
flaviviruses, echovirus, rhinovirus, coxsackie virus, 
cornovirus, respiratory syncytial virus, mumps virus, 
rotavirus, measles virus, rubella virus, parvovirus, vaccinia 
virus, HTLV virus, dengue virus, papillomavirus, mollus- 
cum virus, poliovirus, rabies virus, JC virus and arboviral 15 
encephalitis virus. Other reference sequences of interest are 
from genomes or episomes of pathogenic bacteria, particu- 
larly regions that confer drug resistance or allow phylogenic 
characterization of the host (e.g., 16S rRNAor correspond- 
ing DNA). For example, such bacteria include chlamydia, 2 o 
rickettsial bacteria, mycobacteria, staphylococci, treptocci, 
pneumonococci, meningococci and conococci, klebsiella, 
proteus, serratia, pseudomonas, legionella, diphtheria, 
salmonella, bacilli, cholera, tetanus, botulism, anthrax, 
plague, leptospirosis, and Lymes disease bacteria. Other 25 
reference sequences of interest include those in which 
mutations result in the following autosomal recessive dis- 
orders: sickle cell anemia, ^-thalassemia, phenylketonuria, 
galactosemia, Wilson's disease, hemochromatosis, severe 
combined immunodeficiency, alpha-l-antitrypsin 30 
deficiency, albinism, alkaptonuria, lysosomal storage dis- 
eases and Ehlers-Danlos syndrome. Other reference 
sequences of interest include those in which mutations result 
in X-linked recessive disorders: hemophilia, glucose-6- 
phosphate dehydrogenase, agammaglobulimenia, diabetes 35 
insipidus, Lesch-Nyhan syndrome, muscular dystrophy, 
Wiskott-Aldrich syndrome, Fabry's disease and fragile 
X-syndrome. Other reference sequences of interest includes 
those in which mutations result in the following autosomal 
dominant disorders: familial hypercholesterolemia, polycys- 40 
tic kidney disease, Huntington's disease, hereditary 
spherocytosis, Marfan's syndrome, von Willebrand's 
disease, neurofibromatosis, tuberous sclerosis, hereditary 
hemorrhagic telangiectasia, familial colonic polyposis, 
Ehlers-Danlos syndrome, myotonic dystrophy, muscular 45 
dystrophy, osteogenesis imperfecta, acute intermittent 
porphyria, and von Hippel-Lindau disease. 

Although an array of oligonucleotide analogue probes is 
usually laid down in rows and columns for simplified data 
processing, such a physical arrangement of probes on the 50 
solid substrate is not essential. Provided that the spatial 
location of each probe in an array is known, the data from 
the probes is collected and processed to yield the sequence 
of a target irrespective of the physical arrangement of the 
probes on, e.g., a chip. In processing the data, the hybrid- 55 
ization signals from the respective probes is assembled into 
any conceptual array desired for subsequent data reduction, 
whatever the physical arrangement of probes on the sub- 
strate. 

In one embodiment, a basic tiling strategy provides an 60 
array of immobilized probes for analysis of a target oligo- 
nucleotide showing a high degree of sequence similarity to 
one or more selected reference oligonucleotide (e.g., detec- 
tion of a point mutation in a target sequence). For instance, 
a first probe set comprises a plurality of probes exhibiting 65 
perfect complementarity with a selected reference oligo- 
nucleotide. The perfect complementarity usually exists 



throughout the length of the probe. However, probes having 
a segment or segments of perfect complementarity that is/are 
flanked by leading or trailing sequences lacking comple- 
mentarity to the reference sequence can also be used. Within 
a segment of complementarity, each probe in the first probe 
set has at least one interrogation position that corresponds to 
a nucleotide in the reference sequence. The interrogation 
position is aligned with the corresponding nucleotide in the 
reference sequence when the probe and reference sequence 
are aligned to maximize complementarity between the two. 
If a probe has more than one interrogation position, each 
corresponds with a respective nucleotide in the reference 
sequence. The identity of an interrogation position and 
corresponding nucleotide in a particular probe in the first 
probe set cannot be determined simply by inspection of the 
probe in the first set. An interrogation position and corre- 
sponding nucleotide is defined by the comparative structures 
of probes in the first probe set and corresponding probes 
from additional probe sets. 

For each probe in the first set, there are, for purposes of 
the present illustration, multiple corresponding probes from 
additional probe sets. For instance, there are optionally 
probes corresponding to each nucleotide of interest in the 
reference sequence. Each of the corresponding probes has an 
interrogation position aligned with that nucleotide of inter- 
est. Usually, the probes from the additional probe sets are 
identical to the corresponding probe from the first probe set 
with one exception. The exception is that at the interrogation 
position, which occurs in the same position in each of the 
corresponding probes from the additional probe sets. This 
position is occupied by a different nucleotide in the corre- 
sponding probe sets. Other tiling strategies are also 
employed, depending on the information to be obtained. 

The probes are oligonucleotide analogues which are 
capable of hybridizing with a target nucleic sequence by 
complementary base-pairing. Complementary base pairing 
includes sequence-specific base pairing, which comprises, 
e.g., Watson-Crick base pairing or other forms of base 
pairing such as Hoogsteen base pairing. The probes are 
attached by any appropriate linkage to a support. 3' attach- 
ment is more usual as this orientation is compatible with the 
preferred chemistry used in solid phase synthesis of oligo- 
nucleotides and oligonucleotide analogues (with the excep- 
tion of, e.g., analogues which do not have a phosphate 
backbone, such as peptide nucleic acids). 

EXAMPLES 

The following examples are provided by way of illustra- 
tion only and not by way of limitation. A variety of param- 
eters can be changed or modified to yield essentially similar 
results. 

One approach to enhancing oligonucleotide hybridization 
is to increase the thermal stability (T m ) of the duplex formed 
between the target and the probe using oligonucleotide 
analogues that are known to increase T^'s upon hybridiza- 
tion to DNA. Enhanced hybridization using oligonucleotide 
analogues is described in the examples below, including 
enhanced hybridization in oligonucleotide arrays. 

Example 1 

Solution oligonucleotide melting T m 

The T m of 2 ! -0-methyl oligonucleotide analogues was 
compared to the T m for the corresponding DNA and RNA 
sequences in solution. In addition, the T m of 2-0-methyl 
oligonucleotide:DNA, 2'-0-methyl oligonucleotide :RNA 
and RNA: DNA duplexes in solution was also determined. 
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The T m was determined by varying the sample temperature 
and monitoring the absorbance of the sample solution at 260 
nm The oligonucleotide samples were dissolved in a 0.1 M 
NaG solution with an oligonucleotide concentration of 2 
fM. Table 1 summarizes the results of the experiment. The 
results show that the hybridization of DNA in solution has 
approximately the same T m as the hybridization of DNA 
with a 2'-0-methyl-substituted oligonucleotide analogue. 
The results also show that the T m for the 2'-0-methyl- 
substituted oligonucleotide duplex is higher than that for the 
corresponding RNA:2'-0-methyl-substituted oligonucle- 
otide duplex, which is higher than the T m for the corre- 
sponding DNA:DNA or RNA: DNA duplex. 

TABLE 1 

Solution Oligonucleotide Melting Experiments 
(+) - Target Sequence 
(5'-CIXjAA(XKjTAGCATCTTGAC-3 , )(SEQ ID NO: 6)* 
(-) - Complementary Sequence 



Type of Oligonucleotide, 
Target Sequence (+) 


Type of Oligonucleotide, 
Complementary Sequence (+) 


T m (° C.) 


DNA(+) 


DNA(-) 


61.6 


DNA(+) 


2"OMe(-) 


58.6 


2*OMe(+) 


DNA(-) 


61.6 


2*OMe(+) 


2XDMc(-) 


78.0 


RNA(+) 


DNA(-) 


58.2 


RNA(+) 


2 , OMe(-) 


73.6 



*T refers to thymine for the DNA oligonucleotides, or uracil for the RNA 
oligonucleotides. 

Example 2 

Array hybridization experiments with DNA chips and 
oligonucleotide analogue targets 

A variable length DNA probe array on a chip was 
designed to discriminate single base mismatches in the 3 
corresponding sequences 
5 f -CTGAACGGTAGCATCTTGAC-3' (SEQ ID NO:6) 
(DNA target), 5 , -CUGAACGGUAGCAUCUUGAC-3 , 
(SEQ ID NO:8) (RNA target) and 
5'-CUGAACGGUAGCAUCUUGAC-3' (SEQ ID NO:9) 
(2'-0-methyl oligonucleotide target), and generated by the 
VLSIPS™ procedure. The Chip was designed with adjacent 
12-mers and 8-mers which overlapped with the 3 target 
sequences as shown in Table 2. 
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rate of increase in intensity was then plotted for each probe 
position. The rate of increase in intensity was similar for 
both targets in the 8-mer probe arrays, but the 12-mer probes 
hybridized more rapidly to the DNA target oligonucleotide. 

Plots of intensity versus probe position were generated for 
the RNA, DNA and 2-0 -methyl oligonucleotides to ascer- 
tain mismatch discrimination. The 8-mer probes displayed 
similar mismatch discrimination against all targets. The 
12-mer probes displayed the highest mismatch discrimina- 
tion for the DNA targets, followed by the 2'-0-methyl target, 
with the RNA target showing the poorest mismatch discrimi- 
nation. 

Thermal equilibrium experiments were performed by 
hybridizing each of the targets to the chip for 90 minutes at 
5° C. temperature intervals. The chip was hybridized with 
the target in 5x SSPE at a target concentration of 10 nM. 
Intensity measurements were taken at the end of the 90 
minute hybridization at each temperature point as described 
above. All of the targets displayed similar stability, with 
minimal hybridization to the 8-mer probes at 30° C. In 
addition, all of the targets showed similar stability in hybrid- 
izing to the 12-mer probes. Thus, the 2 f -0-methyl oligo- 
nucleotide target had similar hybridization characteristics to 
DNA and RNA targets when hybridized against DNA 
probes. 

Example 3 

2'-0-methyl-substituted oligonucleotide chips 
DMT-protected DNA and 2'-0-methyl phosphoramidites 
were used to synthesize 8-mer probe arrays on a glass slide 
using the VLSIPS™ method. The resulting chip was hybrid- 
ized to DNA and RNA targets in separate experiments. The 
target sequence, the sequences of the probes on the chip and 
the general physical layout of the chip is described in Table 
3. 

The chip was hybridized to the RNA and DNA targets in 
successive experiments. The hybridization conditions used 
were 10 nM target, in 5x SSPE. The chip and solution were 
heated from 20° C. to 50° C, with a fluorescence measure- 
ment taken at 5 degree intervals as described in SN PCT/ 
US94/12305. The chip and solution were maintained at each 
temperature for 90 minutes prior to fluorescence measure- 
ments. The results of the experiment showed that DNA 
probes were equal or superior to 2'-0-methyl oligonucle- 
otide analogue probes for hybridization to a DNA target, but 
that the 2'-0-methyl analogue oligonucleotide probes 



TABLE 2 



Array hybridization Experiments 



Target 1 (DNA) 

8-mer probe (complement) 

12-mcr probe (complement) 

Target 2 (RNA) 

8-mer probe (complement) 

12-mer probe (complement) 

Target 3 (2'-0-Me oligo) 

8-mer probe (complement) 

12-mer probe (complement) 



5'-CTGAACGGTAGCATCTTGAC-3' (SEQ ID NO: 6) 



S'-CUGAACGGUAGCAUCUUGAC-S' (SEQ ID NO: 8) 



5'-CUGAACGGUAGCAUCUUGAC-3* (SEQ ID NO: 9) 



Target oligos were synthesized using standard techniques. 
The DNA and 2'-0-methyl oligonucleotide analogue target 
oligonucleotides were hybridized to the chip at a concen- 
tration of 10 nM in 5x SSPE at 20° C. in sequential 65 
experiments. Intensity measurements were taken at each 
probe position in the 8-mer and 12-mer arrays over time. The 



showed dramatically better hybridization to the RNA target 
than the DNA probes. In addition, the 2 f -0-methyl analogue 
oligonucleotide probes showed superior mismatch discrimi- 
nation of the RNA target compared to the DNA probes. The 
difference in fluorescence intensity between the matched and 
mismatched analogue probes was greater than the difference 
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between the matched and mismatched DNA probes, dra- 
matically increasing the signal-to-noise ratio. FIG. 1 dis- 
plays the results graphically (FIGS. 1A and IB). (M) and (P) 
indicate mismatched and perfectly matched probes, respec- 
tively. (FIGS. 1C and ID) illustrates the fluorescence inten- 
sity versus location on an example chip for the various 
probes at 20° C. using DNA and RNA targets, respectively. 

TABLE 3 

2 -O- methyl Oligonucleotide Analogues on a Chip. 
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Target Sequence (DNA): 
Target Sequence (RNA): 



5-CTG AACGGTAGC ATCTTGAC-3 ' 
(SEQ ID NO: 6) 

5'-CUGAACGGUAGCAlICUUGAC-3' 
(SEQ ID NO: 8) 

Matching DNA oligonucleotide S'-CTTGCCXT (SEQ ID NO: 10) 
probe {DNA (M)} 

Matching 2 , -0-mcthyl S'-CUUGCCAU (SEQ ID NO: 11) 

oligonucleotide analogue probe 
{2*OMe (M)} 

DNA oligonucleotide probe with S-CTTGCTAT (SEQ ID NO: 12) 
1 base mismatch {DNA (P)} 

2'-0-methyl oligonucleotide S'-CUUGCUAU (SEQ ID NO: 13) 
analogue probe with 1 base 
mismatch {2'OMe (M)} 

SCHEMATIC REPRESENTATION OF 2*-0-METHYlVDNA CHIP 

Matching 2'-0-methyl oligonucleotide analogue probe 
2'-0- methyl oligonucleotide analogue probe with 1 base mismatch 
DNA oligonucleotide probe with 1 base mismatch 
Matching DNA oligonucleotide probe 
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Example 4 

Synthesis of oligonucleotide analogues 

The reagent MeNPoc-Cl group reacts non-selectively 
with both the 5' and 3' hydroxyls on 2-O-methyl nucleoside 
analogues. Thus, to generate high yields of S'-O-MeNPoc- 
2'-0-methylribonucleoside analogues for use in oligonucle- 
otide analogue synthesis, the following protection- 
deprotection scheme was utilized. 

The protective group DMT was added to the 5-0 position 
of the 2'-0-methylribonucleoside analogue in the presence 
of pyridine. The resulting 5-O-DMT protected analogue 
was reacted with TBDMS-Triflate in THF, resulting in the 
addition of the TBDMS group to the 3-0 of the analogue. 
The 5'-DMT group was then removed with TCAA to yield 
a free OH group at the 5 f position of the 2'-0 -methyl 
ribonucleoside analogue, followed by the addition of 45 
MeNPoc-Cl in the presence of pyridine, to yield 5'-0- 
MeNPoc-3'-0-TBDMS-2'-0-methyl ribonucleoside ana- 
logue. The TBDMS group was then removed by reaction 
with NaF, and the 3-OH group was phosphitylated using 
standard techniques. 

Two other potential strategies did not result in high 
specific yields of 5'-0-MeNPoc-2'-0-methylribonucleoside. 
In the first, a less reactive MeNPoc derivative was synthe- 
sized by reacting MeNPoc-Cl with N-hydroxy succimide to 
yield MeNPoc-NHS. This less reactive photocleavable 
group (MeNPoc-NHS) was found to react exclusively with 
the 3 1 hydroxyl on the Z-O-memylribonucleoside analogue. 
In the second strategy, an organotin protection scheme was 
used. Dibutyltin oxide was reacted with the 2'-0- 
methylribonucleoside analogue followed by reaction with 
MeNPoc. Both 5'-0-MeNPoc and 3 f -0-MeNPoc 2'-0- 
methylribonucleoside analogues were obtained. 

Example 5 

Hybridization to mixed-sequence oligodeoxynucleotide 
probes substituted with 2-amino-2'-deoxyadenosine (D) 

To test the effect of a 2-amino-2 , -deoxyadenosine (D) 
substitution in a heterogeneous probe sequence, two 4x4 
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oligodeoxynucleotide arrays were constructed using 
VLSI PS™ methodology and 5'-0-MeNPOC-protected 
deoxynucleoside phosphoramidites. Each array was com- 
prised of the following set of probes based on the sequence 
(3>CATCGTAGAA-(5') (SEQ ID NO:l): 

1. ^HEG)-(3 , )-CATN 1 GTAGAA-(5 t ) (SEQ ID NO:14) 

2. -(HEG)-(3 , )-CA^CN 2 TAGAA-(5 , ) (SEQ ID NO:15) 

3. -(HEG)-(3')-CArCGN 3 AGAA-(5') (SEQ ID NO:l6) 

4. -(HEG)-(3')-CA^CGTN 4 GAA-(5 , ) (SEQ ID NO:17) 
where HEG«hexaetbyleneglycol linker, and N is either 
A,G,C or T, so that probes are obtained which contain single 
mismatches introduced at each of four central locations in 
the sequence. The first probe array was constructed with all 
natural bases. In the second array, 2-amino-2'- 
deoxyadenosine (D) was used in place of adenosine (A). 
Both arrays were hybridized with a S'-fluorescein-labeled 
oligodeoxynucleotide target, (5')-Fl-d 
(CTGAACGGTAGCATCTTGACH^') (SEQ ID NO:18), 
which contained a sequence (in bold) complementary to the 
base probe sequence. The hybridization conditions were: 10 
nM target in 5x SSPE buffer at 22° C with agitation. After 
30 minutes, the chip was mounted on the flowcell of a 
scanning laser confocal fluorescence microscope, rinsed 
briefly with 5x SSPE buffer at 22° C, and then a surface 
fluorescence image was obtained. 

The relative efficiency of hybridization of the target to the 
complementary and single-base mismatched probes was 
determined by comparing the average bound surface fluo- 
rescence intensity in those regions of the of the array 
containing the individual probe sequences. The results (FIG. 
3) show that a 2-amino-2'-deoxyadenosine (D) substitution 
in a heterogeneous probe sequence is a relatively neutral 
one, with little effect on either the signal intensity or the 
specificity of DNA-DNA hybridization, under conditions 
where the target is in excess and the probes are saturated. 

Example 6 

Hybridization to a dA-homopolymer oligodeoxynucle- 
otide probe substituted with 2-amino-2'-deoxyadenosine (D) 

The following experiment was performed to compare the 
hybridization of 2-deoxyadenosine containing homopoly- 
mer arrays with 2-amino-2'-deoxyadenosine homopolymer 
arrays. The experiment was performed on two 11-mer oli- 
godeoxynucleotide probe containing arrays. Two 11-mer 
oligodeoxynucleotide probe sequences were synthesized on 
a chip using 5-O-MeNPOC-protected nucleoside phos- 
phoramidites and standard VLSIPS™ methodology. 

The sequence of the first probe was: (HEG)-(3*)-d 
(AAAAANAAAAA)-(5*) (SEQ ID NO:19); where HEG« 
hexaethyleneglycol linker, and N is either A,G,C or T. The 
second probe was the same, except that dA was replaced by 
2-amino-2'-deoxyadenosine (D). The chip was hybridized 
with a S'-fluorescein-labeled oligodeoxynucleotide target, 
(5>m-d(TTTITGTrTIT)-(3') (SEQ ID NO:20), which 
contained a sequence complementary to the probe sequences 
where N— C. Hybridization conditions were 10 nM target in 
5x SSPE buffer at 22° C. with agitation. After 15 minutes, 
the chip was mounted on the flowcell of a scanning laser 
confocal fluorescence microscope, rinsed briefly with 5x 
SSPE buffer at 22° C (low stringency), and a surface 
fluorescence image was obtained. Hybridization to the chip 
was continued for another 5 hours, and a surface fluores- 
cence image was acquired again. Finally, the chip was 
washed briefly with 0.5x SSPE (high-stringency), then with 
5x SSPE, and re-scanned. 

The relative efficiency of hybridization of the target to the 
complementary and single-base mismatched probes was 
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determined by comparing the average bound surface fluo- 
rescence intensity in those regions of the of the array 
containing the individual probe sequences. The results (FIG. 
4) indicate that substituting 2'-deoxyadenosine with 
2-amino-2'-deoxyadeaosine in a d(A)„ homopolymer probe 5 
sequence results in a significant enhancement in specific 
hybridization to a complementary oligodeoxynucleotide 
sequence. 

Example 7 

Hybridization to alternating A-T oligodeoxynucleotide io 
probes substituted with 5-propynyl-2'-deoxyuridine (P) and 
2-amino-2'-deoxyadenosine (D) 

Commercially available S'-DMT-protected 
2'-deoxynucleoside/nucleoside -analog phosphoramidites 
(Glen Research) were used to synthesize two decanucleotide is 
probe sequences on separate areas on a chip using a modified 
VLSIPS™ procedure. In this procedure, a glass substrate is 
initially modified with a terminal-MeNPOC-protected hexa- 
ethyleneglycol linker. The substrate was exposed to light 
through a mask to remove the protecting group from the 20 
linker in a checkerboard pattern. The first probe sequence 
was then synthesized in the exposed region using DMT- 
phosphoramidites with acid-deprotection cycles, and the 
sequence was finally capped with (MeO) 2 PNiPr 2 /tetrazole 
followed by oxidation. A second checkerboard exposure in ^ 
a different (previously unexposed) region of the chip was 
then performed, and the second probe sequence was syn- 
thesized by the same procedure. The sequence of the first 
"control" probe was: -(HEG)-(3')-CGCGCCGCGC-(5') 
(SEQ ID NO:21); and the sequence of the second probe was 
one of the following: 30 
^(HEGHS^ATArAATArAHS') (SEQ ID NO:22) 

2. -(HEG)-(3')-d(APAPAAPAPA)-(5') (SEQ ID NO:23) 

3. -(HEG)-(y)-d(DTDTDDTDTD)-(5 t ) (SEQ ID NO:24) 

4. -(HEG)-(3 , )-d(DPDPDDPDPD)-(5') (SEQ ID NO:25) 
where HEG=hexaethyleneglycol linker, A=2'- 35 
deoxyadenosine, T«thymidine, D»2-amino-2'- • 
deoxyadenosine, and P«5-propynyl-2'-deoxyuridine. Each 
chip was then hybridized in a solution of a fluorescein- 
labeled oligodeoxynucleotide target, (5')-Fluorescein-d 
(TATAITArAr)-(HEG)-d(GCGCGGCGCG)-(3') (SEQ ID ^ 
NO:26 and SEQ ID NO:27), which is complementary to 
both the A/T and G/C probes. The hybridization conditions 
were: 10 nM target in 5x SSPE buffer at 22° C with gentle 
shaking. After 3 hours, the chip was mounted on the flowcell 

of a scanning laser confocal fluorescence microscope, rinsed 
briefly with 5x SSPE buffer at 22° C, and then a surface 
fluorescence image was obtained. Hybridization to the chip 
was continued overnight (total hybridization time-20hr), 
and a surface fluorescence image was acquired again. 

The relative efficiency of hybridization of the target to the 
A/T and substituted A/T probes was determined by compar- 50 
ing the average surface fluorescence intensity bound to those 
parts of the chip containing the A/T or substituted probe to 
the fluorescence intensity bound to the G/C control probe 
sequence. The results (FIG. 5) show that 5-propynyl-dU and 
2-amino-dA substitution in an A/T-rich probe significantly 55 
enhances the affinity of an oligonucleotide analogue for 
complementary target sequences. The unsubstituted A/T- 
probe bound only 20% as much target as the all-G/C-probe 
of the same length, while the D- & P-substituted A/T probe 
bound nearly as much (90%) as the G/C-probe. Moreover, 6Q 
the kinetics of hybridization are such that, at early times, the 
amount of target bound to the substituted A/T probes 
exceeds that which is bound to the all-G/C probe. 

Example 8 

Hybridization to oligodeoxynucleotide probes substituted 65 
with 7-deaza-2'-deoxyguanosine (ddG) and 2-deoxyinosine 
(dl) 
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A 16x64 oligonucleotide array was constructed using 
VLSIPS™ methodology, with 5'-0-MeNPOC-protected 
nucleoside phosphoramidites, including the analogs ddG, 
and dl. The array was comprised of the set of probes 
represented by the following sequence: -(linker)-(3')-d(A T 
G T T Gj G 2 G 3 G 4 G 5 C G G G TH5*); (SEQ ID NO:28) 
where underlined bases are fixed, and the five internal 
deoxyguanosines (G us ) are substituted with G, ddG, dl, and 
T in all possible (1024 total) combinations. A complemen- 
tary oligonucleotide target, labeled with fluorescein at the 
5'-end: (5*)-Fl-d(C A A T A CAACCCCCGCCCA 
T C CH3*) (SEQ ID NO:29), was hybridized to the array. 
The hybridization conditions were: 5 nM target in 6x SSPE 
buffer at 22° C. with shaking. After 30 minutes, the chip was 
mounted on the flowcell of an Affymetrix scanning laser 
confocal fluorescence microscope, rinsed once with 0.25 x 
SSPE buffer at 22° C, and then a surface fluorescence image 
was acquired. 

The "efficiency" of target hybridization to each probe in 
the array is proportional to the bound surface fluorescence 
intensity in the region of the chip where the probe was 
synthesized. The relative values for a subset of probes (those 
containing dG-KldG and dG-*dl substitutions only) are 
shown in FIG. 6. Substitution of guanosine with 
7-deazaguanosine within the internal run of five G's results 
in a significant enhancement in the fluorescence signal 
intensity which measures hybridization. Deoxyinosine sub- 
stitutions also enhance hybridization to the probe, but to a 
lesser extent. In this example, the best overall enhancement 
is realized when the dG "run" is -40-60% substituted with 
7-deaza-dG, with the substitutions distributed evenly 
throughout the run (i.e., alternating dG/deaza-dG). 

Example 9 

Synthesis of 5 , -MeNPOC-2'-deoxyinosine-3 , -(N,N- 
diisopropyl-2-cyanoethyl)phosphoramidite 

2*-deoxyinosine (5.0 g, 20 mmole) was dissolved in 50 ml 
of dry DMF, and 100 ml dry pyridine was added and 
evaporated three times to dry the solution. Another 50ml 
pyridine was added, the solution was cooled to -20° C. 
under argon, and 13.8 g (50 mmole) of MeNPOC-chloride 
in 20 ml dry DCM was then added dropwise with stirring 
over 60 minutes. After 60 minutes, the cold bath was 
removed, and the solution was allowed to stir overnight at 
room temperature. Pyridine and DCM were removed by 
evaporation, 500 ml of ethyl acetate was added, and the 
solution was washed twice with water and then with brine 
(200 ml each). The aqueous washes were combined and 
back-extracted twice with ethyl acetate, and then all of the 
organic layers were combined, dried with Na 2 S0 4 , and 
evaporated under vacuum. The product was recrystallized 
from DCM to obtain 5.0 g (50% yield) of pure 5'-0- 
MeNPOC-2 ! -deoxyinosine as a yellow solid (99% purity, 
according to *H-NMR and HPLC analysis). 

The MeNPOC-nucleoside (2.5 g, 5.1 mmole) was sus- 
pended in 60 ml of dry CH 3 CN and phosphitylated with 
2-cyanoethyl-N,N,N'^'-tetraisopropylphosphorodiamidite 
(1.65 g/1.66 ml; 5.5 mmole) and 0.47 g (2.7 mmole) of 
diisopropylammonium tetrazolide, according to the pub- 
lished procedure of Barone, et al. (Nucleic Acids Res. (1984) 
12, 4051-61). The crude phosphoramidite was purified by 
flash chromatography on silica gel (90:8:2 DCM-MeOH- 
EtgN), co-evaporated twice with anhydrous acetonitrile and 
dried under vacuum for -24 hours to obtain 2.8 g (80%) of 
the pure product as a yellow solid (98% purity as determined 
by *H/ 31 P-NMR and HPLC). 
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Example 10 

Synthesis of 5'-MeNPOC-7-deaza-2*-deoxy(N2- 
isobutyryl)-guanosine-3 f -(N,N-diisopropyl-2-cyanoethyl) 
phosphoramidite. 

The protected nucleoside 7-deaza-2'-deoxy(N2- 
isobutyryl)guanosine (1.0 g, 3 mmole; Chemgenes Corp., 
Waltham, Mass.) was dried by co^evaporating three times 
with 5 ml anhydrous pyridine and dissolved in 5 ml of dry 
pyridine-DCM (75:25 by vol.). The solution was cooled to 
-45° C. (dry ice/CH 3 CN) under argon, and a solution of 0.9 
g (3.3 mmole) MeNPOC-Ci in 2 ml dry DCM was then 
added dropwise with stirring. After 30 minutes, the cold bath 
was removed, and the solution allowed to stir overnight at 
room temperature. The solvents were evaporated, and the 
crude material was purified by flash chromatography on 
silica gel (2.5%-5% MeOH in DCM) to yield 1.5 g (88% 
yield) S'-MeNPOC-T-deaza^-deoxy^-isobutyryl) 
guanosine as a yellow foam. The product was 98% pure 
according to *H-NMR and HPLC analysis. 

The MeNPOC-nucleoside (1.25 g, 2.2 mmole) was phos- 
phitylated according to the published procedure of Barone, 
et al. (Nucleic Acids Res. (1984) 12, 4051-61). The crude 
product was purified by flash chromatography on silica gel 
(60:35:5 hexane-ethyl acetate-EtjN), co-evaporated twice 
with anhydrous acetonitrile and dried under vacuum for -24 
hours to obtain 1.3 g (75%) of the pure product as a yellow 
solid (98% purity as determined by 'H/^P-NMR and 
HPLC). 
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Example 11 

Synthesis of 5 ( -MeNPOC-2,6-bis(phenoxyacetyl) -2,6- 
diaminopurine-2 , -deoxyriboside-3'-(N,N-diisopropyl-2- 
cyanoethyl)phosphoramidite. 

The protected nucleoside 2,6-bis(phenoxyacetyl) -2,6- 
diaminopurine-2-deoxyriboside (8 mmole, 4.2 g) was dried 
by coevaporating twice from anhydrous pyridine, dissolved 
in 2:1 pyridine/DCM (17.6 ml) and then cooled to -40° C. 
MeNPOC-chloride (8 mmole, 2.18 g) was dissolved in 
DCM (6.6 mis) and added to reaction mixture dropwise. The 
reaction was allowed to stir overnight with slow warming to 
room temperature. After the overnight stirring, another 2 
mmole (0.6 g) in DCM (1.6 ml) was added to the reaction 
at -40° C. and stirred for an additional 6 hours or until no 
unreacted nucleoside was present. The reaction mixture was 
evaporated to dryness, and the residue was dissolved in ethyl 
acetate and washed with water twice, followed by a wash 
with saturated sodium chloride. The organic layer was dried 
with MgS0 4 , and evaporated to a yellow solid which was 
purified by flash chromatography in DCM employing a 
methanol gradient to elute the desired product in 51% yield. 

The 5'-MeNPOC-nucieoside (4.5 mmole, . 3.5 g) was 
phosphitylated according to the published procedure of 
Barone, et al. (Nucleic Acids Res. (1984) 12, 4051-61). The 
crude product was purified by flash chromatography on 
silica gel (99:0.5:0.5 DCM-MeOH-Et 3 N). The pooled frac- 
tions were evaporated to an oil, redissolved in a minimum 
amount of DCM, precipitated by the addition of 800 ml ice 
cold hexane, filtered, and then dried under vacuum for -24 
hours. 

Overall yield was 56%, at greater than 96% purity by 
HPLC and W'P-NMR. 

Example 12 

5'-0-MeNPOC-protected phosphoramidites for incorpo- 
rating 7-deaza-2 f deoxyguanosine and 2-deoxyinosine into 
VLSSIPS™ Oligonucleotide Arrays 
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VLSIPS oligonucleotide probe arrays in which all or a 
subset of all guanosine residues are substitutes with 7-deaza- 
2'-deoxyguanosine and/or 2'-deoxyinosine are highly desir- 
able. This is because guanine-rich regions of nucleic acids 
associate to form multi-stranded structures. For example, 
short tracts of G residues in RNA and DNA commonly 
associate to form tetrameric structures (Zimmermann et al. 
(1975) 7. Mol Biol 92: 181; Kim, J. (1991) Nature 351: 
331; Sen et al. (1988) Nature 335: 364; and Sunquist et al. 
(1989) Nature 342: 825). The problem this poses to chip 
hybridization-based assays is that such structures may com- 
pete or interfere with normal hybridization between comple- 
mentary nucleic acid sequences. However, by substituting 
the 7-deaza-G analog into G-rich nucleic acid sequences, 
particularly at one or more positions within a run of G 
residues, the tendency for such probes to form higher-order 
structures is suppressed, while maintaining essentially the 
same affinity and sequence specificity in double -stranded 
structures. This has been exploited in order to reduce band 
compression in sequencing gels (Mizusawa, et al. (1986) 
N.A.R. 14: 1319) to improve target hybridization to G-rich 
probe sequences in VLSIPS arrays. Similar results are 
achieved using inosine (see also, Sanger et al. (1977) 
P.N.A.S. 74: 5463). 

For facile incorporation of 7-deaza-2'-deoxyguanosine 
and 2'-deoxyinosine into oligonucleotide arrays using 
VLSIPS™ methods, a nucleoside phosphoramidite compris- 
ing the analogue base which has a 5'-O f -MeNPOC- 
protecting group is constructed. This building block was 
prepared from commercially available nucleosides accord- 
ing to Scheme I. These amidites pass the usual tests for 
coupling efficiency and photolysis rate. 



SCHEME I 




MeNPOC-CVjpyridine 
CH^Cb ' 



NH-ibu 



McNPOC-O 



McNPOC-O 
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NH-ibu 
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-continued 
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Although the foregoing invention has been described in 
some detail by way of illustration and example for purposes 
of clarity of understanding, modifications can be made 
thereto without departing from the spirit or scope of the 
appended claims. 



All publications and patent applications cited in this 
10 application are herein incorporated by reference for all 
purposes as if each individual publication or patent appli- 
cation were specifically and individually indicated to be 
incorporated by reference. 



SEQUENCE LISTING 



(1) GENERAL INFORMATION : 

(iii) NUMBER OF SEQUENCES: 29 



(2) INFORMATION FOR SEQ ID N0:1: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH i 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

( xi ) SEQUENCE DESCRIPTION : SEQ ID NO : 1 : 
AAGATGCTAC 



(2) INFORMATION FOR SEQ ID NO:2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:2: 
AAAAANAAAA A 



(2) INFORMATION FOR SEQ ID NO:3: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:3: 
ATATAATATA 



(2) INFORMATION FOR SEQ ID NOl4: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 
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(ii) MOLECULE TYPE: DNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 4: 
CGCGCCGCGC 

(2) INFORMATION FOR SEQ ID NO: 5: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY : linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 6.. 10 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- "N - guanosine (G), 
2' ,3'-dideoxyguanine (ddG), 
2'-deoxyinoaine (dl) or thymine (T)' 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:5: 

TGGGCNNNNN TTGTA 



(2) INFORMATION FOR SEQ ID NO: 6: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single . 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..20 

(D) OTHER INFORMATION: /note- "Target DNA sequence" 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:6: 
CTGAACGGTA GCATCTTGAC 



(2) INFORMATION FOR SEQ ID NO: 7: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..20 

(D) OTHER INFORMATION: /note- "Complementary DNA sequence" 
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:7: 
GTCAAGATGC TACCGTTCAG 



(2) INFORMATION FOR SEQ ID NO: 8: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 
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(ii) MOLECULE TYPE: RNA 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..20 

(D) OTHER INFORMATION t /note- "Target RNA sequence" 
(xl) SEQUENCE DESCRIPTION: SEQ ID NO:8i 
CUGAACGGUA GCAUCUUGAC 20 




(2) INFORMATION FOR SEQ ID NO: 9: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 baBe pairs 

(B) TYPE: nucleic acid 

(C) STRANDE DNE SS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION : /deac - *2 l -0-methyl oligonucleotide" 

(ix) FEATURE: 

(A) NAME/KEY: modified_baee 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_baee- cm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /mod_base= urn 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /mod_base- gm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 4 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- "2 ■-O-methyladenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mocLbaee- OTHER 
/note- "2 '-O-methyladenooine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiedjaase 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /modjbase- cm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /mo4_baae- gm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION: /mocLbase- gm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 9 

(D) OTHER INFORMATION: /mod_base- urn 

(ix) FEATURE: 

(A) NAME/KEY: modifiedjbase 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /modjbase- OTHER 
/note- "2 '-O-methyladenosine" 
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(ix) FEATURE: 

(A) NAME /KEY: modified_base 

(B) LOCATION: 11 

{D) OTHER INFORMATION: /mo debase- gm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 12 

(D) OTHER INFORMATION: /modjbase- cm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 13 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- *2 '-O-methy lade no sine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiedjaase 
<B) LOCATION: 14 

(D) OTHER INFORMATION: /mod^baee- urn 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 15 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME /KEY: modified_base 

(B) LOCATION: 16 

(D) OTHER INFORMATION: /mod_base= urn 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 17 

(D) OTHER INFORMATION: /mod_baee- urn 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 18 

(D) OTHER INFORMATION: /mod_base- gm 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 19 

(D) OTHER INFORMATION: /mod_base= OTHER 
/note- "2 * -O-methy ladeno sine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 20 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..20 

(D) OTHER INFORMATION: /note- "Target 2'-0-methyl 
oligonucleotide sequence" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:9: 

NNNNNNNNNN NNNNNNNNNN 



(2) INFORMATION FOR SEQ ID NO: 10: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 8 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..8 

(D) OTHER INFORMATION: /note- "Matching DNA oligonucleotide 
probe" 
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(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 10: 
CTTGCCAT 



(2) INFORMATION FOR SEQ ID NO: 11 1 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 8 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - "2 ' -O-methyl oligonucleotide" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_baBe- cm 

(ix) FEATURE: 

(A) NAME /KEY: modifiecLbaee 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /modjbase- urn 

(ix) FEATURE: 

(A) NAME/KEY: modifiedjaase 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /mod_base- urn 

(ix) FEATURE: 

(A) NAME/KEY: modified_baae 

(B) LOCATION: 4 

(D) OTHER INFORMATION: /mod_base- gm 

(ix) FEATURE: 

(A) NAME /KEY: modi£ed_base 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base= cm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION! 7 

(D) OTHER INFORMATION: /mod. base- OTHER 
/note- "2 '-O-methyladenosine* 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION: /mocLbase- um 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..B 

. (D) OTHER INFORMATION: /note- "Matching 2 ' -O-methyl 
oligonucleotide analogue probe" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:ll: 

NNNNNNNN 



(2) INFORMATION FOR SEQ ID NO: 12: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 8 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 




(ix) FEATURE i 
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(A) NAME/KEY: - 

(B) LOCATION: 1..8 

(D) OTHER INFORMATION: /note- "DNA oligonucleotide probe 
with 1 base mismatch" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:12: 

CTTGCTAT 



(2) INFORMATION FOR SEQ ID NO: 13: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 8 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - "2 1 -O-methyl oligonucleotide" 

(ix) FEATURE: 

(A) NAME/ KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /modjaase- cm 

(ix) FEATURE: 

(A) NAME/ KEY: modified_base 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /mod_base= um 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /mod_base- um 

(ix) FEATURE: 

(A) NAME/KEY: modified_baae 

(B) LOCATION: 4 

(D) OTHER INFORMATION: /mod_base- gm 

(ix) FEATURE: 

(A) NAME/KEY: modifiedjaase 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mod_base- cm 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base- um 

(ix) FEATURE : 

(A) NAME/KEY: modified_base 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- **2 ' -O-methyladenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION: /modjaase- um 

(ix) FEATURE: 

(A) NAME/KEY: - 

(B) LOCATION: 1..8 

(D) OTHER INFORMATION: /note- "2 '-O-methyl oligonucleotide 
analogue probe with 1 base mismatch" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:13: 

NNNNNNNN 



(2) INFORMATION FOR SEQ ID NO: 14: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 
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(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DKA 

(ix) FEATURE: 

(A) NAME/KEY: modified.base 

(B) LOCATION: 10 

(D) OTHER INFORMATION : /mod_base- OTHER 
/note- "N - cytosine covalently 
modified at the 3* phosphate group with 
a hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 14: 

AAGATGNTAN 



(2) INFORMATION FOR SEQ ID NO: 15: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_baBe- OTHER 

/note- *N - cytosine covalently modified 
at the 3 ' phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 15: 

AAGATNCTAN 



(2) INFORMATION FOR SEQ ID NO: 16: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- **N - cytosine covalently modified 
at the 3 ' phosphate group with a 
hexaethyleneglycol (HEG) linker" 

' (xi) SEQUENCE DESCRIPTION: SEQ ID NO: 16: 

AAGANGCTAN 10 



(2) INFORMATION FOR SEQ ID NO: 17: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - cytosine covalently modified 
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at the 3 ' phosphate group with a 
hexaethyleneglycol (HEG) linker" 

<xi) SEQUENCE DESCRIPTION: SEQ ID NO:17: 

AAGNTGCTAN 10 

\ 

(2) INFORMATION FOR SEQ ID NO: 18: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 20 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - cytosine covalently modified 
at the 5' phosphate group with a 
fluorescein molecule" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:18: 

NTGAACGGTA GCATCTTGAC 20 

(2) INFORMATION FOR SEQ ID NO: 19: 

(i) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 11 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 11 

(D) OTHER INFORMATION : /modjbase- OTHER 

/note- *N - adenine covalently modified 
at the 3 ' phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 19: 

AAAAANAAAA N 11 



(2) INFORMATION FOR SEQ ID NO: 20 1 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 base pairs 

(B) TYPE : nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - thymine covalently modified 
at the 5 1 phosphate group with a 
fluorescein molecule" 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 20: 
NTTTTGTTTT T 



11 
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(2) INFORMATION FOR SEQ ID NO: 21: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY : linear 

(ii) MOLECULE TYPE I DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "H - cytosine covalently modified 
at the 3 ' phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 21: 

CGCGCCGCGN 10 



(2) INFORMATION FOR SEQ ID NO: 22: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - *2 ' -deoxynucleoside/nucleoside 
analogue decanucleotide probe" 

(ix) FEATURE: . 

(A) NAME/KEY :. modifiedJ>ase 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_baae- OTHER 
/note- "N - 2 ' -deoxyadenosine" 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /mod_base= OTHER 
/note- "N - 2 '-deoxyadenosine" 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- "N - 2 ' -deoxyadenosine" 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- *N - 2 ' -deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modi£ed_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION: /modjbase- OTHER 
/note- "N - 2 ' -deoxyadenosine" 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 2 * -deoxyadenosine covalently 
modified at the 3' phosphate group with 
a hexaethyleneglycol (HEG) linker** 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 22: 



NTNTNNTNTN 10 
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(2) IK FORMAT ION FOR SEQ ID NO: 23: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) ' STRANDEDNESS : single 

(D) TOPOLOGY: linear 



(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - "2 ' -deoxynucleoside/nucleoside 
analogue decanucleotide probe" 

(ix) FEATURE: 

(A) NAME/KEY: modifiedjaase 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- "N - 2 '-deoxyadenosine" 



(ix) FEATURE: 

(A) NAME /KEY: modified_base 
<B) LOCATION: 2 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 5-propynyl-2* -deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /moeLbase- OTHER 
/note- "N - 2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 4 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 5-propynyl-2 '-deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION : 5 

(D) OTHER INFORMATION: /mod_baee- OTHER 
/note- "N - 2' -deoxyadenosine" 



( ix ) FEATURE : 

(A) NAME/KEY: modified_base 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base- OTHER 
/note- **N - 2' -deoxyadenosine" 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /modjbase- OTHER 

/note- "N - 5-propynyl-2' -deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION: /moeLbase- OTHER 
/note- "N - 2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 9 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 5-propynyl-2 '-deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *N - 2 '-deoxyadenosine covalently 
modified at the 3' phosphate group with 
a hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 23: 



NNNNNNNNNN 



10 



• # 
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(2) INFORMATION FOR SEQ ID NO:24: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE i ' nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY : linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - "2 ' -deoxynucleoside/nucleoside 
analogue decanucleotide probe" 

(ix) FEATURE: 

(A) NAME/ KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 2 -amino-2 '-deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 3 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- *N - 2 -amino-2 '-deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/ KEY: modified.base 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N ■ 2 -amino-2 '-deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/ KEY: modified_base 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 2 -amino- 2 '-deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 2 -amino- 2 '-deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /modjjase- OTHER 

/note- *N - 2 -amino-2 '-deoxyadenosine 
covalently modified at the 3' 
phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:24: 

NTNTNNTNTN 1 0 



(2) INFORMATION FOR SEQ ID NO: 25: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: other nucleic acid 

(A) DESCRIPTION: /desc - "2 ' -deoxynucleoside/nucleoside 
analogue decanucleotide probe" 

(ix) FEATURE: 

(A) NAME/KEY: modifiedjbase 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /modjbase- OTHER 

/note- *N ■ 2 - amino-2 ' -deoxyadenosine" 

(ix) FEATURE: 

(A) NAME/ KEY: modifiedjbase 

(B) LOCATION: 2 

(D) OTHER INFORMATION: /modJ>ase- OTHER 

/note- "N - 5-propynyl-2'-deoxyuridine" 
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{ ix ) FEATURE : 

. (A) NAME/KEY: modified_base 
(B) LOCATION: 3 

(D) OTHER INFORMATION: /mod-base- OTHER 

/note- "N - 2 -amino-2 ' -deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(E) LOCATION: 4 

(D) OTHER INFORMATION: /mod_ba8e- OTHER 

/note- *N - 5-propynyl-2 ' -deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 5 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - 2 -amino-2 '-deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 6 

(D) OTHER INFORMATION: /modjMiae- OTHER 

/note- "N - 2 -amino-2 ' -deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 7 

(D) OTHER INFORMATION: /mocLbase- OTHER 

/note- "N - 5-propynyl-2 '-deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 8 

(D) OTHER INFORMATION: ./mod_base- OTHER 

/note- "N - 2-amino-2 '-deoxy adenosine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 9 

(D) OTHER INFORMATION: /mod_base« OTHER 

/note- "N « 5-propynyl-2 ' -deoxyuridine" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 10 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note* *N - 2-amino-2' -deoxy adenosine 
covalently modified at the 3 ' 
phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:25: 

NNNNNNNNNN 10 



(2) INFORMATION FOR SEQ ID NO: 26: 

( i ) SEQUENCE CHARACTERISTICS : 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY: linear 



(ii) MOLECULE TYPE: DNA 



(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - thymine covalently modified 
at the 5' hydroxy 1 group with a 
fluorescein molecule" 

(ix) FEATURE: 

(A) NAME/KEY: modifiecLbase 

(B) LOCATION: 10 

(D) OTHER INFORMATION : /modJ>ase- OTHER 
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/note- "N - thymine covalently modified 
at the 3 ' phosphate group with a 
hexaethyleneglycol (HEG) linker which io 
covalently bound to the 5' phosphate 
group of the 5' guanine (N in pos. 1 ) of 
SEQ ID NO: 27" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 26: 

NATATTATAN 10 



(2) INFORMATION FOR SEQ ID NO: 27: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 10 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDS ONES S : single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/KEY: modified_baee 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /raod__base« OTHER 

/note- *N - guanine covalently modified 
at the 5 ' phosphate group with a 
hexaethyleneglycol (HEG) linker which is 
covalently bound to the 3* phosphate 
group of the 3* thymine (N in pos. 10) 
of SEQ ID NO: 26* 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:27: 

NCGCGGCGCG 10 



(2) INFORMATION FOR SEQ ID NO: 28: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 15 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 

(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/ KEY: modifiedjbase 

(B) LOCATION: 6.. 10 

(D) OTHER INFORMATION: /mod_base« OTHER 
/note- *N - guanine (G), 
2 ' , 3 ' -dideoxyguanine (ddG), 
2 ' -deoxyinosine (dl) or thymine (T)" 

(ix) FEATURE: 

(A) NAME/KEY: modified_base 

(B) LOCATION: 15 

(D) OTHER INFORMATION: /mod_base- OTHER 

/note- "N - cytosine covalently modified 
at the 5 ' phosphate group with a 
hexaethyleneglycol (HEG) linker" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:28: 

TGGGCNNNNN TTGTN 15 



(2) INFORMATION FOR SEQ ID NO: 29: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 21 base pairs 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



51 
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ks(ii) MOLECULE TYPE: DNA 

(ix) FEATURE: 

(A) NAME/ KEY: modified_base 

(B) LOCATION: 1 

(D) OTHER INFORMATION: /mocLbase- OTHER 

/note- "N - cytosine covalently modified 
at the 5 ' phosphate group with a 
fluorescein molecule" 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 29: 

NAATACAACC CCCGCCCATC C 21 



What is claimed is: 

1. A composition for analyzing interactions between oli- 
gonucleotide targets and oligonucleotide probes comprising 
an array of a plurality of oligonucleotide analogue probes 20 
having different sequences, wherein said oligonucleotide 
analogue probes are coupled to a solid substrate at known 
locations and wherein said plurality of oligonucleotide ana- 
logue probes are selected to bind to complementary oligo- 
nucleotide targets with a similar hybridization stability 25 
across the array. 

2. The composition of claim 1, wherein at least one of said 
oligonucleotide analogue probes is selected to maintain 
hybridization specificity or mismatch discrimination with 
said complementary oligonucleotide targets. 30 

3. The composition of claim 1, wherein at least one of said 
oligonucleotide analogue probes has increased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 35 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

4. The composition of claim 1, wherein at least one of said 
oligonucleotide analogue probes has decreased the thermal 
stability between said oligonucleotide analogue probe and 40 
said complementary oligonucleotide target as compared to 

an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

5. The composition of claim 2, wherein at least one of said 45 
oligonucleotide analogue probes has increased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 

an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 50 
oligonucleotide analogue probe anneals. 

6. The composition of claim 2, wherein at least one of said 
oligonucleotide analogue probes has decreased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 55 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

7. The composition of claims 1-5 or 6, wherein said solid 
substrate is selected from the group consisting of silica, 60 
polymeric materials, glass, beads, chips, and slides. 

8. The composition of claims 1-5 or 6 f wherein said 
composition comprises an array of oligonucleotide analogue 
probes 5 to 20 nucleotides in length. 

9. The composition of claims 1-5 or 6, wherein said array 65 
of oligonucleotide analogue probes comprises a nucleoside 
analogue with the formula 




wherein: 

the nucleoside analogue is not a naturally occurring DNA 
or RNA nucleoside; 

R 1 is selected from the group consisting of hydrogen, 
methyl, hydroxyl, alkoxy, alkylhio, halogen, cyano, 
and azido; 

R 2 is selected from the group consisting of hydrogen, 
methyl, hydroxyl, alkoxy, alkythio, halogen, cyano, 
and azido; 

Y is a heterocyclic moiety; 

and wherein said nucleoside analogue is incorporated into 
the oligonucleotide analogue by attachment to a 3* 
hydroxyl of the nucleoside analogue, to a 5' hydroxyl of 
the nucleoside analogue, or both the 3* nucleoside and 
the 5* hydroxyl of the nucleoside analogue. 

10. The composition of claims 1-5 or 6, wherein said 
array of 

oligonucleotide analogue probes comprises a nucleoside 
analogue with the formula 




wherein: 

the nucleoside analogue is not a naturally occurring DNA 
or RNA nucleoside; 

R 1 is selected from the group consisting of hydrogen, 
hydroxyl, methyl, methoxy, ethoxy, propoxy, allyloxy, 
propargyloxy, Fluorine, Chlorine, and Bromine; 

R 2 is selected from the group consisting of hydrogen, 
hydroxyl, methyl, methoxy, ethoxy, propoxy, allyloxy, 
propargyloxy, Fluorine, Chlorine, and Bromine; and 
Y is a base selected from the group consisting of 
purines, purine analogues pyrimidines, pyrimidine 
analogues, 3-nitropyrrole and 5-nitroindole; 

and wherein said nucleoside analogue is incorporated into 
the oligonucleotide analogue by attachment to a 3' 
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hydroxyl of the nucleoside analogue, to a 5* hydro xyl of 
the nucleoside analogue, or both the 3' nucleoside and 
the 5' hydroxyl of the nucleoside analogue. 

11. The composition of claims 1-5 or 6, wherein each 
probe of said plurality of oligonucleotide analogue probes 
has at least one oligonucleotide analogue, and wherein at 

. least one of said oligonucleotide analogues comprises a 
peptide nucleic acid. 

12. The composition of claims 1-5 or 6, wherein at least 
one of said plurality of oligonucleotide analogue probes said 
array of oligonucleotide analogue probes is resistant to 
RNAase A 

13. The composition of claims 1-5 or 6, wherein said 
solid substrate is attached to over 1000 different oligonucle- 
otide analogue probes. 

14. The composition of claims 1-5 or 6, wherein each 
probe of said plurality of oligonucleotide analogue probes 
has at least one oligonucleotide analogue, and wherein at 
least one of said oligonucleotide analogues comprises 2-0- 
methyl nucleotides. 

15. The composition of claims 1-5 or 6, wherein said 
array of oligonucleotide analogue probes and said solid 
substrate comprises a plurality of different oligonucleotide 
analogue probes, each oligonucleotide analogue probes hav- 
ing the formula: 

Y — l 1 — x 1 — L 2 — x 2 
wherein, 

Y is a solid substrate; 

X 1 and X 2 are complementary oligonucleotides contain- 
ing at least one nucleotide analogue; 
L 1 is a spacer; 

L 2 is a linking group having sufficient length such that X 1 
and X 2 form a double-stranded oligonucleotide. 

16. The composition of claim 15, wherein said composi- 
tion comprises a library of unimolecular double-stranded 
oligonucleotide analogue probes. 

17. The composition of claims 1-5 or 6, wherein said 
array of oligonucleotide analogue probes comprises a con- 
formationally restricted array of oligonucleotide analogue 
probes with the formula: 
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— X 11 — Z— X 

wherein X 11 and X 12 are complementary oligonucleotides 
or oligonucleotide analogues and Z is a presented 
moiety. 

18. The composition of claims 1—5 or 6, wherein each 
probe of said plurality of oligonucleotide analogue probes 
has at least one oligonucleotide analogue, and wherein at 
least one of said oligonucleotide analogues comprises a 
nucleotide with a base selected from the group of bases 
consisting of 5-propynyluracil, 5-propynylcytosine, 
2-aminoadenine, 7-deazaguanine, 2-aminopurine, 8-aza-7- 
deazaguanine, lH-purine, and hypoxanthine. 

19. The composition of claims 1-5 or 6, wherein said 
plurality of oligonucleotide analogue probes are coupled to 
said solid substrate by light-directed chemical coupling. 

20. The composition of claim 19, wherein said solid 
substrate is derivitized with a silane reagent prior to syn- 
thesis of said plurality of oligonucleotide analogue probes. 

21. The composition of claims 1-5 or 6, wherein said 
plurality of oligonucleotide analogue probes are coupled to 
said solid substrate by flowing oligonucleotide analogue 
reagents over known locations of the solid substrate. 

22. The composition of claim 21, wherein said solid 
substrate is derivitized with a silane reagent prior to syn- 
thesis of said plurality of oligonucleotide analogue probes. 
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23. The composition of claims 1-5 or 6, wherein at least 
one of plurality of said oligonucleotide analogue probes 
forms a first duplex with a target oligonucleotide sequence, 
wherein said oligonucleotide analogue probe has a corre- 
sponding oligonucleotide sequence that forms a second 
duplex with said target oligonucleotide sequence, wherein 
said second duplex is rich in A-T or G-C nucleotide pairs, 
and wherein said oligonucleotide analogue probe has at least 
one nucleotide analogue in place of an A, T, G, or C 
nucleotide of said corresponding oligonucleotide sequence 
at a position within said oligonucleotide analogue probe 
such that said first duplex has an increased hybridization 
stability than said second duplex. 

24. The composition of claim 23, wherein said oligo- 
nucleotide analogue probe contains fewer bases than said 
corresponding oligonucleotide sequence. 

25. The composition of claims 1-5 or 6, wherein said 
oligonucleotide analogue probe forms a first duplex with a 
target oligonucleotide sequence, wherein said oligonucle- 
otide analogue probe has a corresponding oligonucleotide 
sequence that forms a second duplex with said target poly- 
nucleotide sequence, and wherein said oligonucleotide ana- 
logue probe is shorter than said corresponding polynucle- 
otide sequence. 

26. A composition for analyzing the interaction between 
an oligonucleotide target and an oligonucleotide probe com- 
prising an array of a plurality of oligonucleotide probes 
having different sequences hybridized to complementary 
oligonucleotide analogue targets, wherein said oligonucle- 
otide analogue targets bind to complementary oligonucle- 
otide probes with a similar hybridization stability across the 
array. 

27. The composition of claim 26, wherein at least one of 
said oligonucleotide analogue target is selected to maintain 
hybridization specificity or mismatch discrimination with 
said complementary oligonucleotide probes. 

28. The composition of claim 26, wherein at least one of 
said oligonucleotide analogue targets has increased the 
thermal stability between said oligonucleotide analogue 
target and said complementary oligonucleotide probe as 
compared to an oligonucleotide target that is the perfect 
complement to the complementary oligonucleotide probe 
with which said oligonucleotide analogue target anneals. 

29. The composition of claim 26, wherein at least one of 
said oligonucleotide analogue targets has decreased the 
thermal stability between said oligonucleotide analogue 
target and said complementary oligonucleotide probe as 
compared to an oligonucleotide target that is the perfect 
complement to the complementary oligonucleotide probe 
with which said oligonucleotide analogue target anneals. 

30. The composition of claim 27, wherein at least one of 
said oligonucleotide analogue targets has increased the 
thermal stability between said oligonucleotide analogue 
target and said complementary oligonucleotide probe as 
compared to an oligonucleotide target that is the perfect 
complement to the complementary oligonucleotide probe 
with which said oligonucleotide analogue target anneals. 

31. The composition of claim 27, wherein at least one of 
said oligonucleotide analogue targets has decreased the 
thermal stability between said oligonucleotide analogue 
target and said complementary oligonucleotide probe as 
compared to an oligonucleotide target that is the perfect 
complement to the complementary oligonucleotide probe 
with which said oligonucleotide analogue target anneals. 

32. The composition of claims 26-30 or 31, wherein the 
oligonucleotide analogue target is a PCR amplicon. 

33. The composition of claims 26-30 or 31, wherein at 
least one of said plurality of oligonucleotide probes com- 
prise at least one oligonucleotide analogue. 
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34. The composition of claims 26-30 or 31, wherein at 
least one target oligonucleotide analogue acid is an RNA 
nucleic acid. 

35. A method analyzing interactions between an oligo- 
nucleotide target and an oligonucleotide probe, comprising 5 
the steps of: 

(a) , synthesizing an oligonucleotide analogue array com- 
prising a plurality of oligonucleotide analogue probes 
having different sequences, wherein said oligonucle- 
otide analogue probes are coupled to a solid substrate 3 o 
at known locations, said solid substrate having a sur- 
face; 

(b) . exposing said oligonucleotide analogue probe array to 
a plurality of oligonucleotide targets under hybridiza- 
tion conditions such that said plurality of oligonucle- 15 
otide analogue probes bind to complementary oligo- 
nucleotide targets with a similar hybridization stability 
across the array; and 

(c) . determining whether an oligonucleotide analogue 
probe of said oligonucleotide analogue probe array 20 
binds to at least one of said target nucleic acids. 

36. The method in accordance of claim 35, wherein at 
least one of said oligonucleotide analogue probes is selected 
to maintain hybridization specificity or mismatch discrimi- 
nation with said complementary oligonucleotide targets. 25 

37. The method in accordance of claim 35, wherein at 
least one of said oligonucleotide analogue probes has 
increased the thermal stability between said oligonucleotide 
analogue probe and said complementary oligonucleotide 
target as compared to an oligonucleotide probe that is the 30 
perfect complement to the complementary oligonucleotide 
target with which said oligonucleotide analogue probe 
anneals. 

38. The method in accordance of claim 35, wherein at 
least one of said oligonucleotide analogue probes has 35 
decreased the thermal stability between said oligonucleotide 
analogue probe and said complementary oligonucleotide 
target as compared to an oligonucleotide probe that is the 
perfect complement to the complementary oligonucleotide 
target with which said oligonucleotide analogue probe 40 
anneals. 

39. The method in accordance of claim 36, wherein at 
least one of said oligonucleotide analogue probes has 
increased the thermal stability between said oligonucleotide 
analogue probe and said complementary oligonucleotide 45 
target as compared to an oligonucleotide probe that is the 
perfect complement to the complementary oligonucleotide . 
target with which said oligonucleotide analogue probe 
anneals. 

40. The method in accordance of claim 36, wherein at 50 
least one of said oligonucleotide analogue probes has 
decreased the thermal stability between said oligonucleotide 
analogue probe and said complementary oligonucleotide 
target as compared to an oligonucleotide probe that is the 
perfect complement to the complementary oligonucleotide 55 
target with which said oligonucleotide analogue probe 
anneals. 

41. The method of claims 35-39 or 40, wherein said 
oligonucleotide target is selected from the group comprising 
genomic DNA, cDNA, unspliced RNA, mRNA, and rRNA. 60 

42. The method of claims 35-39 or 40, wherein said target 
nucleic acid is amplified prior to said hybridization step. 

43. The method of claims 35-39 or 40, wherein said 
plurality of oligonucleotide analogue probes is synthesized 
on said solid support by light-directed synthesis. 65 

44. The method of claims 35-39 or 40, wherein said 
plurality of said oligonucleotide analogue probes is synthe- 



sized on said solid support by causing oligonucleotide 
analogue synthetic reagents to flow over known locations of 
said solid support. 

45. The method of claims 35-39 or 40, wherein said step 
(a), comprises the steps of: 

i) . forming a plurality of channels adjacent to the surface 
of said substrate; 

ii) . placing selected reagents in said channels to synthe- 
size oligonucleotide analogue probes at known loca- 
tions; and 

iii) . repeating steps i). and ii). thereby forming an array of 
oligonucleotide analogue probes having different 
sequences at known locations on said substrate. 

46. The method of claims 35-39 or 40, wherein said solid 
substrate is selected from the group consisting of beads, 
slides, and chips. 

47. The method of claims 35-39 or 40, wherein said solid 
substrate is comprised of materials selected from the group 
consisting of silica, polymers and glass. 

48. The method of claims 35-39 or 40, wherein the 
oligonucleotide analogue probes of said array are synthe- 
sized using photoremovable protecting groups. 

49. The method of claims 35-39 or 40, further comprising 
selectively incorporating MeNPoc onto the 3* or 5' hydroxyl 
of at least one nucleoside analogue and selectively incorpo- 
rating said nucleoside analogue into at least one of said 
oligonucleotide analogue probes. 

50. The method of claims 35-39 or 40, wherein at least 
one of said oligonucleotide analogue probes is synthesized 
from pbosphoramidite nucleoside reagents. 

51. A method of detecting an oligonucleotide target, 
comprising enzymatically copying an oligonucleotide target 
using at least one nucleotide analogue, thereby producing 
multiple oligonucleotide analogue targets, selecting said 
oligonucleotide analogue targets such that said oligonucle- 
otide analogue targets bind to the complementary oligo- 
nucleotide probes coupled to a solid surface at known 
locations of an array with a similar hybridization stability 
across the array, hybridizing the oligonucleotide analogue 
targets to complementary oligonucleotide probes, and 
detecting whether at least one of said oligonuclotide ana- 
logue targets binds to said complementary oligonucleotide 
acid probe. 

52. The method of claim 51, wherein at least one of said 
oligonucleotide analogue targets is selected to maintain 
hybridization specificity or mismatch discrimination with 
said complementary oligonucleotide probes. 

53. The method of claim 51, wherein at least one of said 
oligonucleotide analogue targets has increased the thermal 
stability between said oligonucleotide analogue target and 
said complementary oligonucleotide probe as compared to 
an oligonucleotide target that is the perfect complement to 
the complementary oligonucleotide probe with which said 
oligonucleotide analogue target anneals. 

54. The method of claim 51, wherein at least one of said 
oligonucleotide analogue targets has decreased the thermal 
stability between said oligonucleotide analogue target and 
said complementary oligonucleotide probe as compared to 
an oligonucleotide target that is the perfect complement to 
the complementary oligonucleotide probe with which said 
oligonucleotide analogue target anneals. 

55. The method of claim 52, wherein at least one of said 
oligonucleotide analogue targets has increased the thermal 
stability between said oligonucleotide analogue target and 
said complementary oligonucleotide probe as compared to 
an oligonucleotide target that is the perfect complement to 
the complementary oligonucleotide probe with which said 
oligonucleotide analogue target anneals. 



6,1. 

57 

56. The method of claim 52, wherein at least one of said 
oligonucleotide analogue targets has decreased the thermal 
stability between said oligonucleotide analogue target and 
said complementary oligonucleotide probe as compared to 
an oligonucleotide target that is the perfect complement to 
the complementary oligonucleotide probe with which said 
oligonucleotide analogue target anneals. 

57. The method of claims 51-55 or 56, wherein the 
oligonucleotide probe array comprises at least one oligo- 
nucleotide analogue probe which is complementary to at 
least one of said oligonucleotide analogue targets. 

58. A method of making an array of oligonucleotide 
probes, comprising providing a plurality of oligonucleotide 
analogue probes having at least one oligonucleotide 
analogue, said oligonucleotide analogue probes having dif- 
ferent sequences at known locations on an array, selecting 
the oligonucleotide analogue probes to hybridize with 
complementary oligonucleotide target sequences under 
hybridization conditions such that said oligonucleotide ana- 
logue probes bind to complementary oligonucleotide targets 
with a similar hybridization stability, across the array. 

59. The method of claim 58, wherein at least one of said 
oligonucleotide analogue probes is selected to maintain 
hybridization specificity or mismatch discrimination with 
said complementary oligonucleotide targets. 

60. The method of claim 58, wherein at least one of said 
oligonucleotide analogue probes has increased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

61. The method of claim 58, wherein at least one of said 
oligonucleotide analogue probes has decreased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

62. The method of claim 59, wherein at least one of said 
oligonucleotide analogue probes has increased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

63. The method of claim 59, wherein at least one of said 
oligonucleotide analogue probes has decreased the thermal 
stability between said oligonucleotide analogue probe and 
said complementary oligonucleotide target as compared to 
an oligonucleotide probe that is the perfect complement to 
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the complementary oligonucleotide target with which said 
oligonucleotide analogue probe anneals. 

64. The method in accordance with claims 58-62, or 63, 
further comprising incorporating at least one oligonucle- 

5 otide analogue into at least one of the oligonucleotide 
analogue probes of the array to reduce or prevent the 
formation of secondary structure in the oligonucleotide of 
the array. 

65. The method in accordance with claims 58-62, or 63, 
10 further comprising incorporating at least one oligonucle- 
otide analogue into at least one of the oligonucleotide target 
to reduce or prevent the formation of secondary structure in 
the target polynucleotide sequence. 

66. The method in accordance with claims 58-62, or 63, 
15 further comprising incorporating at least one oligonucle- 
otide analogue into at least one of the oligonucleotide 
analogue probes of the array to create secondary structure in 
the oligonucleotide of the array. 

67. The method in accordance with claims 58-62, or 63, 
20 further comprising incorporating a base selected from the 

group consisting of 5-propynyluracil, 5-propynylcytosine, 
2-aminoadenine, 7-deazaguanine, 2-aminopurine, 8-aza-7- 
deazaguanine, lH-purine, and hypoxanthine into the oligo- 
nucleotide analogue probes of the array. 
25 68. The method of claim 67 further comprising selecting 
said at least one oligonucleotide analogue such that the 
oligonucleotide analogue probe is a homopolymer. 

69. The method in accordance with claims 58-62, or 63, 
further comprising selecting said at least one oligonucleotide 

30 analogue from the group consisting essentially of oligo- 
nucleotide analogues comprising 2'-0-methyl nucleotides 
and oligonucleotides comprising a base selected from the 
group of bases consisting of 5 -propynyluracil, 
5-propynylcytosine, 7-deazaguanine, 2-aminoadenine, 

35 8-aza-7-deaza guanine, IH-purine, and hypoxanthine. 

70. The method in accordance with claims 58-62 or 63, 
further comprising selecting said at least one oligonucleotide 
analogue such that oligonucleotide analogue probes com- 
prises at least one peptide nucleic acid. 

40 71. The method in accordance with claims 58-62, or 63, 
further comprising selecting said at least one oligonucleotide 
analogue to increase image brightness when the oligonucle- 
otide target and the oligonucleotide analogue probe hybrid- 
ize in the presence of a fluorescent indicator, in comparison 

45 to a oligonucleotide probe without oligonucleotide analogs, 
72. The method in accordance with claims 58-62, or 63, 
further comprising providing said plurality of oligonucle- 
otide analogue probes in an array with at least 1000 other 
oligonucleotide analogue probes. 

50 
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NUCLEIC ACID ARRAYS 

This application is a continuation of Ser. No. 09/129,470 
filed Aug. 4, 1998, which is a continuation of Ser. No. 
08/456,598 filed Jun. 1, 1995, which is a divisional of Ser. 
No. 07/954,646 filed Sep. 30, 1992, now issued as U.S. Pat. 
No. 5,445,934, which is a divisional of Ser. No. 07/850,356, 
filed Mar. 12, 1992, now issued as U.S. Pat. No. 5,405,783, 
which is a divisional of Ser. No. 07/492,462 filed Mar. 7, 
1990, now issued as U.S. Pat. No. 5,143,854, which is a 10 
continuation-in-part of Ser. No. 07/362,901 filed Jun, 7, 
1989, now abandoned, the disclosures of which are incor- 
porated by reference. 

COPYRIGHT NOTICE 1; 

A portion of the disclosure of this patent document 
contains material which is subject to copyright protection. 
The copyright owner has no objection to the facsimile 
reproduction by anyone of the patent document or the patent 
disclosure as it appears in the Patent and Trademark Office 2( 
patent file or records, but otherwise reserves all copyright 
rights whatsoever. 

BACKGROUND OF THE INVENTION 

The present inventions relate to the synthesis and place- 
ment materials at known locations. In particular, one 
embodiment of the inventions provides a method and asso- 
ciated apparatus for preparing diverse chemical sequences at 
known locations on a single substrate surface. The inven- ^ 
tions may be applied, for example, in the field of preparation 
of oligomer, peptide, nucleic acid, oligosaccharide, 
phospholipid, polymer, or drug congener preparation, espe- 
cially to create sources of chemical diversity for use in 
screening for biological activity. ^ 

The relationship between structure and activity of mol- 
ecules is a fundamental issue in the study of biological 
systems. Structure-activity relationships are important in 
understanding, for example, the function of enzymes, the 
ways in which cells communicate with each other, as well as ^ 
cellular control and feedback systems. 

Certain macromolecules are known to interact and bind to 
other molecules having a very specific three-dimensional 
spatial and electronic distribution. Any large molecule hav- 
ing such specificity can be considered a receptor, whether it 45 
is an enzyme catalyzing hydrolysis of a metabolic 
intermediate, a cell-surface protein mediating membrane 
transport of ions, a glycoprotein serving to identify a par- 
ticular cell to its neighbors, an IgG-class antibody circulat- 
ing in the plasma, an oligonucleotide sequence of DNA in 50 
the nucleus, or the like. The various molecules which 
receiptors selectively bind are known as ligands. 

Many assays are available for measuring the binding 
affinity of known receptors and ligands, but the information 
which can be gained from such experiments is often limited 55 
by the number and type of ligands which are available. 
Novel ligands are sometimes discovered by chance or by 
application of new techniques for the elucidation of molecu- 
lar structure, including x-ray crystallographic analysis and 
recombinant genetic techniques for proteins. 60 

Small peptides are an exemplary system for exploring the 
relationship between structure and function in biology. A 
peptide is a sequence of amino acids. When the twenty 
naturally occurring amino acids are condensed into poly- 
meric molecules they form a wide variety of three- 65 
dimensional configurations, each resulting from a particular 
amino acid sequence and solvent condition. The number of 



il,776 Bl 

2 

possible pentapeptides of the 20 naturally occurring amino 
acids, for example, is 20 s or 3.2 million different peptides. 
The likelihood that molecules of this size might be useful in 
receptor-binding studies is supported by epitope analysis 
5 studies showing that some antibodies recognize sequences 
as short , as a few amino acids with high specificity. 
* Furthermore, the average molecular weight of amino acids 
puts small peptides in the size range of many currently 
useful pharmaceutical products. 

Pharmaceutical drug discovery is one type of research 
which relies on such a study of structure-activity relation- 
ships. In most cases, contemporary pharmaceutical research 
can be described as the process of discovering novel ligands 
with desirable patterns of specificity for biologically impor- 
tant receptors. Another example is research to discover new 
compounds for use in agriculture, such as pesticides and 
herbicides. 

Sometimes, the solution to a rational process of designing 
ligands is difficult or unyielding. Prior methods of preparing 
large numbers of different polymers have been painstakingly 
slow when used at a scale sufficient to permit effective 
rational or random screening. For example, the "Merrifield" 
method (J. Am. Chenu Soc. (1963) 85:2149-2154, which is 
incorporated herein by reference for all purposes) has been 
used to synthesize peptides on a solid support. In the 
Merrifield method, an amino acid is covalently bonded to a 
support made of an insoluble polymer. Another amino acid 
with an alpha protected group is reacted with the covalently 
bonded amino acid to form a dipeptide. After washing, the 
protective group is removed and a third amino acid with an 
alpha protective group is added to the dipeptide. This 
process is continued until a peptide. of a desired length and 
sequence is obtained. Using the Merrifield method, it is not 
economically practical to synthesize more than a handful of 
peptide sequences in a day. 

To synthesize larger numbers of polymer sequences, it has 
also been proposed to use a series of reaction vessels for 
polymer synthesis. For example, a tubular reactor system 
may be used to synthesize a linear polymer on a solid phase 
support by automated sequential addition of reagents. This 
method still does not enable the synthesis of a sufficiently 
large number of polymer sequences for effective economical 
screening. 

Methods of preparing a plurality of polymer sequences 
are also known in which a foraminous container encloses a 
known quantity of reactive particles, the particles being 
larger in size than foramina of the container. The containers 
may be selectively reacted with desired materials to synthe- 
size desired sequences of product molecules. As with other 
methods known in the art, this method cannot practically be 
used to synthesize a sufficient variety of polypeptides for 
effective screening. 

Other techniques have also been described. These meth- 
ods include the synthesis of peptides on 96 plastic pins 
which fit the format of standard microtiter plates. 
Unfortunately, while these techniques have been somewhat 
useful, substantial problems remain. For example, these 
methods continue to be limited in the diversity of sequences 
which-can be economically synthesized and screened. 

From the above, it is seen that an improved method and 
apparatus for synthesizing a variety of chemical sequences 
at known locations is desired. 

SUMMARY OF THE INVENTION 

An improved method and apparatus for the preparation of 
a variety of polymers is disclosed. 
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In one preferred embodiment, linker molecules are pro- 
vided on a substrate. A terminal end of the linker molecules 
is provided with a reactive functional group protected with 
a photoremovable protective group. Using lithographic 
methods, the photoremovable protective group is exposed to 
light arid removed from the linker molecules in first selected 
regions. The substrate is then washed or otherwise contacted 
with a first monomer that reacts with exposed functional 
groups on the linker molecules. In a preferred embodiment, 
the monomer is an amino acid containing a photoremovable 
protective group at its amino or carboxy terminus and the 
linker molecule terminates in an amino or carboxy acid 
group bearing a photoremovable protective group. 

A second set of selected regions is, thereafter, exposed to 
light and the photoremovable protective group on the linker 
molecule/protected amino acid is removed at the second set 
of regions. The substrate is then contacted with a second 
monomer containing a photoremovable protective group for 
reaction with exposed functional groups. This process is 
repeated to selectively apply monomers until polymers of a 
desired length and desired chemical sequence are obtained. 
Photolabile groups are then optionally removed and the 
sequence is, thereafter, optionally capped. Side chain pro- 
tective groups, if present, are also removed. 

By using the lithographic techniques disclosed herein, it 
is possible to direct light to relatively small and precisely 
known locations on the substrate. It is, therefore, possible to 
synthesize polymers of a known chemical sequence at 
known locations on the substrate. 

The resulting substrate will have a variety of uses 
including, for example, screening large numbers of poly- 
mers for biological activity. To screen for biological activity, 
the substrate is exposed to one or more receptors such as 
antibody whole cells, receptors on vesicles, lipids, or any 
one of a variety of other receptors. The receptors are 
preferably labeled with, for example, a fluorescent marker, 
radioactive marker, or a labeled antibody reactive with the 
receptor. The location of the marker on the substrate is 
detected with, for example, photon detection or autoradio- 
graphic techniques. Through knowledge of the sequence of 
the material at the location where binding is detected, it is 
possible to quickly determine which sequence binds with the 
receptor and, therefore, the technique can be used to screen 
large numbers of peptides. Other possible applications of the 
inventions herein include diagnostics in which various anti- 
bodies for particular receptors would be placed on a sub- 
strate and, for example, blood sera would be screened for 
immune deficiencies. Still further applications include, for 
example, selective "doping" of organic materials in semi- 
conductor devices, arid the like. 

In connection with one aspect of the invention an 
improved reactor system for synthesizing polymers is also 
disclosed. The reactor system includes a substrate mount 
which engages a substrate around a periphery thereof. The 
substrate mount provides for a reactor space between the 
substrate and the mount through or into which reaction fluids 
are pumped or flowed. A mask is placed on or focused on the 
substrate and illuminated so as to deprotect selected regions 
of the substrate in the reactor space. A monomer is pumped 
through the reactor space or otherwise contacted with the 
substrate and reacts with the deprotected regions. By selec- 
tively deprotecting regions on the substrate and flowing 
predetermined monomers through the reactor space, desired 
polymers at known locations may be synthesized. 

Improved detection apparatus and methods are also dis- 
closed. The detection method and apparatus utilize a sub- 
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strate having a large variety of polymer sequences at known 
locations on a surface thereof. The substrate is exposed to a 
fluoresce ntly labeled receptor which binds to one or more of 
the polymer sequences. The substrate is placed in a micro- 

5 scope detection apparatus for identification of locations 
. where binding takes place. The microscope detection appa-. 

. ratus includes a monochromatic or polychromatic light 
source for directing light at the substrate, means for detect- 
ing fluoresced light from the substrate, and means for 

10 determining a location of the fluoresced light. The means for 
detecting light fluoresced on the substrate may in some 
embodiments include a photon counter. The means for 
determining a location of the fluoresced light may include an 
x/y translation table for the substrate. Translation of the slide 

15 and data collection are recorded and managed by an appro- 
priately programmed digital computer. 

A further understanding of the nature and advantages of 
the inventions herein may be realized by reference to the 
remaining portions of the specification and the attached 

20 drawings. 

BRIEF DESCRIPTION OF THE FIGURES 

FIG. 1 illustrates masking and irradiation of a substrate at 
a first location. The substrate is shown in cross-section; 

25 

FIG. 2 illustrates the substrate after application of a 
monomer "A"; 

FIG. 3 illustrates irradiation of the substrate at a second 
location; 

30 FIG. 4 illustrates the substrate after application of mono- 
mer "B"; 

FIG. 5 illustrates irradiation of the "A" monomer; 
-.- FIG. 6 illustrates the substrate after a second application 
of "B"; 

35 FIG. 7 illustrates a completed substrate; 

FIGS. 8A and SB illustrate alternative embodiments of a 
reactor system for forming a plurality of polymers on a 
substrate; 

^ FIG. 9 illustrates a detection apparatus for locating fluo- 
rescent markers on the substrate; 

FIGS. 10A-10M illustrate the method as it is applied to 
the production of the trimers of monomers "A" and "B"; 
FIGS. 11A and 11B are fluorescence traces for standard 
45 fluorescent beads; 

FIGS. 12A and 12B are fluorescence curves for NVOC 
slides not exposed and exposed to light respectively; 

FIGS. 13 A to 13D are fluorescence plots of slides exposed 
through 100 /an, 50 Jim, 20 //m, and 10 /mi masks; 
50 FIG. 14A and 14B illustrates fluorescence of a slide pith 
the peptide YGGFL on selected regions of-its surface which 
has been exposed to labeled Herz antibody specific for this 
sequence; 

FIGS. 15Aand 15D illustrate formation of and a fluores- 
55 cence plot of a slide with a checkerboard pattern of YGGFL 
and GGFL exposed to labeled Herz antibody. FIG. 15A 
illustrates a 500x500 /an mask which has been focused on 
the substrate according to FIG. 8 A while FIG. 15B illustrates 
a 50x50 /an mask placed in direct contact with the substrate 
in accord with FIG. 8B; 

FIG. 16 is a fluorescence plot of YGGFL and PGGFL 
synthesized in a 50 /on checkerboard pattern; 

FIG. 17 is a fluorescence plot of YPGGFL and is YGGFL 
55 synthesized in a 50 /an checkerboard pattern; 

FIGS. 18A and 18B illustrate the mapping of sixteen 
sequences synthesized on two different glass slides; 
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FIG. 19 is a fluorescence plot of the slide illustrated in 
FIG. 18A; and 

FIG. 20 is a fluorescence plot of the slide illustrated in 
FIG. 10B. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

CONTENTS 
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I. Glossary 

II. General 

III. Polymer Synthesis 

IV. Details of One Embodiment of a Reactor System 

V. Details of One Embodiment of a Fluorescent Detection 15 
Device 

VI. Determination of Relative Binding Strength of Recep- 
tors 

VII. Examples 20 

A. Slide Preparation 

B. Synthesis of Eight Trimers of "A" and "B** 

C. Synthesis of a Dimer of an Aminopropyl Group and 
a Fluorescent Group 

D. Demonstration of Signal Capability 2 5 

E. Determination of the Number of Molecules Per Unit 
Area 

F. Removal of NVOC and Attachment of a Fluorescent 
Marker 

G. Use of a Mask in Removal of NVOC 30 

H. Attachment of YGGFL and Subsequent Exposure to . 
Here Antibody and Goat Antimouse . 

I. Monomer-by-Monomer Formation of YGGFL and 
Subsequent Exposure to Labeled Antibody 

J. Monomer-by-Monomer Synthesis of YGGFL and 35 
PGGFL 

K. Monomer-by Monomer Synthesis of YGGFL and 
YPGGFL 

L. Synthesis of an Array of Sixteen Different Amino 
Acid Sequences and Estimation of Relative Binding ^ 
Affinity to Here Antibody 

VIII. Illustrative Alternative Embodiment 

IX. Conclusion 

I. Glossary 45 

The following terms are intended to have the following 
general meanings as they are used herein: 

1. Complementary: Refers to the topological compatibility 
or matching together of interacting surfaces of a ligand 
molecule and its receptor. Thus, the receptor and its ligand 50 
can be described as complementary, and furthermore, the 
contact surface characteristics are complementary to each 
other. 

2. Epitope: The portion of an antigen molecule which is 
delineated by the area of interaction with the subclass of 55 
receptors known as antibodies. 

3. Ligand: A ligand is a molecule that is recognized by a 
particular receptor. Examples of ligands that can be inves- 
tigated by this invention include, but are not restricted to, 
agonists and antagonists for cell membrane receptors, 60 
toxins and venoms, viral epitopes, hormones (e.g., 
opiates, steroids, etc.), hormone receptors, peptides, 
enzymes, enzyme substrates, cofactors, drugs, lectins, 
sugars, oligonucleotides, nucleic acids, oligosaccharides, 
proteins, and monoclonal antibodies. 65 

4. Monomer: A member of the set of small molecules which 
can be joined together to form a polymer. The set of 



monomers includes but is not restricted to, for example, 
the set of common L-arnino acids, the set of D-amino 
acids, the set of synthetic amino acids, the set of nucle- 
otides and the set of pentoses and hexoses. As used herein, 
monomers refers to any member of a basis set for syn- 
thesis of a polymer. For example, dimers of L-amino acids 
form a basis set of 400 monomers for synthesis of 
polypeptides. Different basis sets of monomers may be 
used at successive steps in the synthesis of a polymer. 

5. Peptide: A polymer in which the monomers are alpha 
amino acids and which are joined together through amide 
bonds and alternatively referred to as a polypeptide. In the 
context of this specification it should be appreciated that 
the amino acids may be the L-optical isomer or the 
D-optical isomer. Peptides are more than two amino acid 
monomers long, and often more than 20 amino acid 
monomers long. Standard abbreviations for amino acids 
are used (e.g., P for proline). These abbreviations are 
included in Stryer, Biochemstry, Third Ed., 1988, which is 
incorporated herein by reference for all purposes. 

6. Radiation: Energy which may be selectively applied 
including energy having a wavelength of between 10~ 14 
and 10 4 meters including, for example, electron beam 
radiation, gamma radiation, x-ray radiation, ultra-violet 
radiation, visible light, infrared radiation, microwave 
radiation, and radio waves. "Irradiation" refers to the 
application of radiation to a surface. 

7. Receptor: A molecule that has an affinity for a given 
ligand. Receptors may be naturally-occuring or manmade 
molecules. Also, they can be employed in their unaltered 
state or as aggregates with other species. Receptors may 
be attached, covalently or noncovalently, to a binding 
member, either directly or via a specific binding sub- 
stance. Examples of receptors which can be employed by 
this invention include, but are not restricted to, antibodies, 
cell membrane receptors, monoclonal antibodies and anti- 
sera reactive with specific antigenic determinants (such as 
on viruses, cells or other materials), drugs, 
polynucleotides, nucleic acids, peptides, cofactors, 
lectins, sugars, polysaccharides, cells, cellular 
membranes, and organelles. Receptors are sometimes 
referred to in the art as anti-ligands. As the term receptors 
is used herein, no difference in meaning is intended. A 
"Ligand Receptor Pair" is formed when two macromol- 
ecules have combined through molecular recognition to 
form a complex. 

Other examples of receptors which can be investigated by 
this invention include but are not restricted to: 

a) Microorganism receptors: Determination of ligands 
which bind to receptors, such as specific transport 
proteins or enzymes essential to survival of 
microorganisms, is useful in a new class of antibiotics. 
Of particular value would be antibiotics against oppor- 
tunistic fungi, protozoa, and those bacteria resistant to 
the antibiotics in current use. 

b) Enzymes: For instance, the binding site of enzymes 
such as the enzymes responsible for cleaving neu- 
rotransmitters; determination of ligands which bind to 
certain receptors to modulate the action of the enzymes 
which cleave the different neurotransmitters is useful in 
the development of drugs which can be used in the 
treatment of disorders of neurotransmission. 

c) Antibodies: For instance, the invention may be useful 
in investigating the ligand-binding site on the antibody 
molecule which combines with the epitope of an anti- 
gen of interest; determining a sequence that mimics an 
antigenic epitope may lead to the development of 
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vaccines of which the immunogen is based on one or 
more of such sequences or lead to the development of 
related diagnostic agents or compounds useful in thera- 
peutic treatments such as for auto-immune diseases 
(e.g., by blocking the binding of the "self; antibodies). 5 
. d) Nucleic Acids: ' Sequences of nucleic acids may be 
synthesized to establish DNA or RNA binding 
sequences. 

e) Catalytic Polypeptides: Polymers, preferably 
polypeptides, which are capable of promoting a chemi- 
cal reaction involving the conversion of one or more 
reactants to one or more products. Such polypeptides 
generally include a binding site specific for at least one 
re act ant or reaction intermediate and an active func- 
tionality proximate to the binding site, which function- 15 
ality is capable of chemically modifying the bound 
reactant. Catalytic polypeptides are described in, for 
example, U.S. application Ser. No. 404,920, which is 
incorporated herein by reference for all purposes. 

f) Hormone receptors: For instance, the receptors for 20 
insulin and growth hormone. Determination of the 
ligands which bind with high affinity to a receptor is 
useful in the development of, for example, an oral 
replacement of the daily injections which diabetics 
must take to relieve the symptoms of diabetes, and in 25 
the other case, a replacement for the scarce human 
growth hormone which can only be obtained from 
cadavers or by recombinant DNA technology. Other 
examples are the vasoconstrictive hormone receptors; 
determination of those ligands which bind to a receptor 30 
may lead to the development of drugs to control blood 
pressure. 

g) Opiate receptors: Determination of ligands which bind 
to the opiate receptors in the brain is useful in the 
development of less-addictive replacements for mor- 35 
phine and related drugs. 

8. Substrate: A material having a rigid or semi-rigid surface. 
In many embodiments, at least one surface of the substrate 
will be substantially flat, although in some embodiments 

it may be desirable to physically separate synthesis 40 
regions for different polymers with, for example, wells, 
raised regions, etched trenches, or the like. According to 
other embodiments, small beads may be provided on the 
surface which may be released upon completion of the 
synthesis. *5 

9. Protective Group: A material which is bound to a mono- 
mer unit and which may be spatially removed upon 
selective exposure to an activator such as electromagnetic 
radiation. Examples of protective groups with utility 
herein include Nitroveratryioxy carbonyl, Nitrobenzyloxy 50 
carbonyl, Dimethyl dimethoxybenzyloxy carbonyl, 
5-Bromo-7-nitroindolinyl, o-Hydroxy-a-methyl 
cinnamoyl, and 2-oxymethyIene anthraquinone. Other 
examples of activators include ion beams, electric fields, 
magnetic fields, electron beams, x-ray, and the like. 55 

10. Predefined Region: A predefined region is a localized 
area on a surface which is, was, or is intended to be 
activated for formation of a polymer. The predefined 
region may have any convenient shape, e.g., circular, 
rectangular, elliptical, wedge-shaped, etc. For the sake of 60 
brevity herein, "predefined regions" are sometimes 
referred to simply as "regions." 

11. Substantially Pure: A polymer is considered to be "sub- 
stantially pure" within a predefined region of a substrate 
when it exhibits characteristics that distinguish it from 65 
other predefined regions. Typically, purity will be mea- 
sured in terms of biological activity or function as a result 



of uniform sequence. Such characteristics will typically 
be measured by way of binding with a selected ligand or 
receptor. 

II. General 

The present invention provides methods and apparatus for 
. the preparation and use of. a substrate having a plurality of 
■ polymer sequences in predefined regions. The invention is 
described herein primarily with regard to the preparation of 
molecules containing sequences of amino acids, but could 
readily be applied in the preparation of other polymers. Such 
polymers include, for example, both linear and cyclic poly- 
mers of nucleic acids, polysaccharides, phospholipids, and 
peptides having either a-, or to-amino acids, hetero- 
polymers in which a known drug is covalently bound to any 
of the above, polyurethanes, polyesters, polycarbonates, 
polyureas, poly amides, polyethyleneimines, polyarylene 
sulfides, polysiloxanes, polyimides, polyacetates, or other 
polymers which will be apparent upon review of this dis- 
closure. In a preferred embodiment, the invention herein is 
used in the synthesis of peptides. 

The prepared substrate may, for example, be used in 
screening a variety of polymers as ligands for binding with 
a receptor, although it will be apparent that the invention 
could be used for the synthesis of a receptor for binding with 
a ligand. The substrate disclosed herein will have a wide 
variety of other uses. Merely by way of example, the 
invention herein can be used in determining peptide and 
nucleic acid sequences which bind to proteins, finding 
sequence-specific binding drugs, identifying epitopes rec- 
ognized by antibodies, and evaluation of a variety of drugs 
for clinical and diagnostic applications, as well as combi- 
nations of the above. 

The invention preferably provides for the use of a sub- 
strate "S" with a surface. Linker molecules "L" are option- 
ally provided on a surface of the substrate. The purpose of 
the linker molecules, in some embodiments, is to facilitate 
receptor recognition of the synthesized polymers. 

Optionally, the linker molecules may be chemically pro- 
tected for storage purposes. A chemical storage protective 
group such as t-BOC (t-butoxycarbonyl) may be used in 
some embodiments. Such chemical protective groups would 
be chemically removed upon exposure to, for example, 
acidic solution and would serve to protect the surface during 
storage and be removed prior to polymer preparation. 

On the substrate or a distal end of the linker molecules, a 
functional group with a protective group P 0 is provided. The 
protective group P 0 may be removed upon exposure to 
radiation, electric fields, electric currents, or other activators 
to expose the functional group. 

In a preferred embodiment, the radiation is ultraviolet 
(UV), infrared (IR), or visible light. As more fully described 
below, the protective group may alternatively be an 
electrochemically-sensitive group which may be removed in 
the presence of an electric field. In still further alternative 
embodiments, ion beams, electron beams, or the like may be 
used for deprotection. 

In some embodiments, the exposed regions and, therefore, 
the area upon which each distinct polymer sequence is 
synthesized are smaller than about 1 cm 2 or less than 1 mm 2 . 
In preferred embodiments the exposed area is less than about 
10,000 pirn 2 or, more preferably, less than 100 pm 1 and may, 
in some embodiments, encompass the binding site for as few 
as a single molecule. Within these regions, each polymer is 
preferably synthesized in a substantially pure form. 

Concurrently or after exposure of a known region of the 
substrate to light, the surface is contacted with a first 
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monomer unit M x which reacts with the functional group 
which has been exposed by the deprotection step. The first 
monomer includes a protective group P r P x may or may not 
be the same as P 0 . 

Accordingly, after a first cycle, known first regions of the 5 
surface may comprise, the sequence: ■ 

* w ■ "- ' ■ 1 

, T ■ * ' v - 4 . J. * * * * r 

while remaining regions of the surface comprise the 10 
sequence: 

S-L-Po. 

Thereafter, second regions of the surface (which may 
include the first region) arc exposed to light and contacted 15 
with a second monomer M 2 (which may or may not be the 
same as MJ having a protective group P 2 . P 2 may or may 
not be the same as P 0 and P a . After this second cycle, 
different regions of the substrate may comprise one or more 
of the following sequences: 20 

S-L-Mj-Mj-Pa 



S-1^M 2 -P 2 



S-l^M l -P 1 and/or 



S-L-P C 
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The above process is repeated until the substrate includes 
desired polymers of desired lengths. By controlling the 3fl 
locations of the substrate exposed to light and the reagents 
exposed to the substrate following exposure, the location of 
each sequence will be known. 

Thereafter, the protective groups are removed from some 
or all of the substrate and the sequences are, optionally, 3J 
capped with a capping unit C. The process results in a 
substrate having a surface with a plurality of polymers of the 
following general formula: 



s-fLXM^^O-CM,) . . . (MXC] 
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where square brackets indicate optional groups, and M, . . . 

indicates any sequence of monomers. The number of 
monomers could cover a wide variety of values, but in a 
preferred embodiment they will range from 2 to 100. 

In some embodiments a plurality of locations on the 45 
substrate polymers are to contain a common monomer 
subsequence. For example, it may be desired to synthesize 
a sequence S-M x -M 2 -M 3 at first locations and a sequence 
S-M^Mj-Ma at second locations. The process would com- 
mence with irradiation of the first locations followed by 50 
contacting with M 2 -P, resulting in the sequence S-M^P at 
the first location. The second locations would then be 
irradiated and contacted with M 4 -P, resulting in the sequence 
S-M 4 -P at the second locations. Thereafter both the first and 
second locations would be irradiated and contacted with the 55 
dimer M 2 -M 3 , resulting in the sequence S-M 1 -M 2 -M 3 at the 
first locations and S-M 4 -M 2 -M 3 at the second locations. Of 
course, common subsequences of any length could be uti- 
lized including those in a range of 2 or more monomers, 2 
to 100 monomers, 2 to 20 monomers, and a most preferred 60 
range of 2 to 3 monomers. 

According to other embodiments, a set of masks is used 
for the first monomer layer and, thereafter, varied light 
wavelengths are used for selective deprotection. For 
example, in the process discussed above, first regions are 65 
first exposed through a mask and reacted with a first mono- 
mer having a first protective group P,, which is removable 



upon exposure to a first wavelength of light (e.g., IR). 
Second regions are masked and reacted with a second 
monomer having a second protecive group P 2 , which is 
removable upon exposure to a second wavelength of light 
(e.g., UV). Thereafter, masks. become unnecessary in the 
synthesis because ;the entire substrate may be exposed 
alternatively to the first, and second wavelengths of light in 
the deprotection cycle. 

The polymers prepared on a substrate according to the 
above methods will have a variety of uses including, for 
example, screening for biological activity. In such screening 
activities, the substrate containing the sequences is exposed 
to an unlabeled or labeled receptor such as an antibody, 
receptor on a cell, phospholipid vesicle, or any one of a 
variety of other receptors. In one preferred embodiment the 
polymers are exposed to a first, unlabeled receptor of interest 
and, thereafter, exposed to a labeled receptor-specific rec- 
ognition element, which is, for example, an antibody. This 
process will provide signal amplification in the detection 
stage. 

The receptor molecules may bind with one or more 
polymers on the substrate. The presence of the labeled 
receptor and, therefore, the presence of a sequence which 
binds with the receptor is detected in a preferred embodi- 
ment through the use of autoradiography, detection of fluo- 
rescence with a charge-coupled device, fluorescence 
microscopy, or the like. The sequence of the polymer at the 
locations where the receptor binding is detected may be used 
to determine all or part of a sequence which is complemen- 
tary to the receptor. 

Use of the invention herein is illustrated primarily , with 
reference to screening for biological activity. The invention 
will, however, find many other uses. For example, the 
invention may be used in information storage (e.g., on 
optical disks), production of molecular electronic devices, 
production of stationary phases in separation sciences, pro- 
duction of dyes and brightening agents, photography, and in 
immobilization of cells, proteins, lectins, nucleic acids, 
polysaccharides and the like in patterns on a surface via 
molecular recognition of specific polymer sequences. By 
synthesizing the same compound in adjacent, progressively 
differing concentrations, a gradient will be established to 
control chemotaxis or to develop diagnostic dipsticks which, 
for example, titrate an antibody against an increasing 
amount of antigen. By synthesizing several catalyst mol- 
ecules in close proximity, more efficient multistep conver- 
sions may be achieved by "coordinate immobilization." 
Coordinate immobilization also may be used for electron 
transfer systems, as well as to provide both structural 
integrity and other desirable properties to materials such as 
lubrication, wetting, etc. . 

According to alternative embodiments, molecular biodis- . 
tribution or pharmacokinetic properties may be examined. 
For example, to assess resistance to intestinal or serum 
proteases, polymers may be capped with a fluorescent tag 
and exposed to biological fluids of interest. 

III. Polymer Synthesis 

FIG. 1 illustrates one embodiment of the invention dis- 
closed herein in which a substrate 2 is shown in cross- 
section. Essentially, any conceivable substrate may be 
employed in the invention. The substrate may be biological, 
nonbiological, organic, inorganic, or a combination of any of 
these, existing as particles, strands, precipitates, gels, sheets, 
tubing, spheres, containers, capillaries, pads, slices, films, 
plates, slides, etc. The substrate may have any convenient 
shape, such as a disc, square, sphere, circle, etc. The 
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substrate is preferably flat but may take oa a variety of 
alternative surface configurations. For example, the sub- 
strate may contain raised or depressed regions on which the 
synthesis takes place. The substrate and its surface prefer- 
ably form a rigid support on which to carry out the reactions 5 
described herein. The substrate and its surface is also chosen 
to provide appropriate light-absorbing characteristics. For 
instance, the substrate may be a polymerized Langmuir 
Blodgett film, functionalized glass, Si, Ge, GaAs, GaP, Si0 2 , 
SiN 4 , modified silicon, or any one of a wide variety of gels 10 
or polymers such as (poly)tetrafluoroethylene, (poly) 
vinylidenedifluoride, polystyrene, polycarbonate, or combi- 
nations thereof. Other substrate materials will be readily 
apparent to those of skill in is the art upon review of this 
disclosure. In a preferred embodiment the substrate is flat 15 
glass or single-crystal silicon with surface relief features of 
less than 10 A. 

According to some embodiments, the surface of the 
substrate is etched using well known techniques to provide 
for desired surface features. For example, by way of the 20 
formation of trenches, v-grooves, mesa structures, or the 
like, the synthesis regions may be more closely placed 
within the focus point of impinging light, be provided with 
reflective "mirror" structures for maximization of light col- 
lection from fluorescent sources, or the like. 2 s 

Surfaces on the solid substrate will usually, though not 
always, be composed of the same material as the substrate. 
Thus, the surface may be composed of any of a wide variety 
of materials, for example, polymers, plastics, resins, 
polysaccharides, silica or silica-based materials, carbon, 3Q 
metals, inorganic glasses, membranes, or any of the above- 
. listed substrate materials. In some embodiments the surface . 
may provide for the use of caged binding members which 
are attached firmly to the surface of the substrate in accord 
with the teaching of copending application Ser. No. 404,920, 
previously incorporated herein by reference. Preferably, the 35 
surface will contain reactive "groups, which could be 
carboxyl, amino, hydroxyl, or the like. Most preferably, the 
surface will be optically transparent and will have surface 
Si — OH functionalities, such as are found on silica surfaces. 

The surface 4 of the substrate is preferably provided with 40 
a layer of linker molecules 6, although it will be understood 
that the linker molecules are not required elements of the 
invention. The linker molecules are preferably of sulfi- 
cientlength to permit polymers in a completed substrate to 
interact freely with molecules exposed to the substrate. The 45 
linker molecules should be 6-50 atoms long to provide 
sufficient exposure. The linker molecules may be, for 
example, aryl acetylene, ethylene glycol oligomers contain- 
ing 2-10 monomer units, diamines, diacids, amino acids, or 
combinations thereof. Other linker molecules may be used SQ 
in light of this disctsoure. 

According to alternative , embodiments, the linker mol- 
ecules are selected based upon their hydrophilic/ 
hydrophobic properties to improve presentation of synthe- 
sized polymers to certain receptors. For example, in the case 
of a hydrophilic receptor, hydrophilic linker molecules will 
be preferred so as to permit the receptor to more closely 
approach the synthesized polymer. 

According to another alternative embodiment, linker mol- 
ecules are also provided with a photocleavable group at an 
intermediate position. The photocleavable group is prefer- 60 
ably cleavable at a wavelength different from the protective 
group. This enables removal of the various polymers fol- 
lowing completion of the synthesis by way of exposure to 
the different wavelengths of light. 

The linker molecules can be attached to the substrate via 6S 
carbon-carbon bonds using, for example, (poly) 
trifluorochloroethylene surfaces, or preferably, by siloxane 
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bonds (using, for example, glass or silicon oxide surfaces). 
Siloxane bonds with the surface of the substrate may be 
formed in one embodiment via reactions of linker molecules 
bearing trichlorosilyl groups. The linker molecules may 
optionally be attached in an ordered array, i.e., as parts of the 
head groups in a polymerized Langmuir Blodgett film. In 
alternative embodiments, the linker molecules are adsorbed 
to the surface of the substrate. 

The linker molecules and monomers used herein are 
provided with a fiinctionial group to which is bound a 
protective group. Preferably, the protective group is on the 
distal or terminal end of the linker molecule opposite the 
substrate. The protective group may be either a negative 
protective group (i.e., the protective group renders the linker 
molecules less reactive with a monomer upon exposure) or 
a positive protective group (i.e., the protective group renders 
the linker molecules more reactive with a monomer upon 
exposure). In the case of negative protective groups an 
additional step of reactivation will be required. In some 
embodiments, this will be done by heating. 

The protective group on the linker molecules may be 
selected from a wide variety of positive light-reactive groups 
preferably including nitro aromatic compounds such as 
o-nitrobenzyl derivatives or bcnzylsulfonyl. In a preferred 
embodiment, 6-nitroveratryloxy-carbonyl (NVOC), 
2-nitrobenzyloxycarbonyl (NBOC) or a,a-dimetbyl- 
dimethoxybenzyloxycarbonyl (DDZ) is used. In one 
embodiment, a nitro aromatic compound containing a ben- 
zylic hydrogen ortho to the nitro group is used, i.e., a 
chemical of the form: 




where Rj is alkoxy, alkyl, halo, aryl, alkenyl, or hydrogen; 
R 2 is alkoxy, alkyl, halo, aryl, nitro, or hydrogen; R 3 is 
alkoxy, alkyl, halo, nitro, aryl, or hydrogen; R 4 is alkoxy, 
alkyl, hydrogen, aryl, halo, or nitro; and R 5 is alkyl, alkynyl, 
cyano, alkoxy, hydrogen, halo, aryl, or alkenyl. Other mate- 
rials which may be used include o-hydroxy-ct-methyl cin- 
namoyl derivatives. Photoremovable protective groups are 
described in, for example, Patchornik, /. Am. Chem. Soc. 
(1970) 92:6333 and Amit et al., J. Org. Chem. (1974) 
39:192, both of which are incorporated herein by reference. 

In an alternative embodiment the positive reactive group 
is activated for reaction with reagents in solution.. For 
example, a 5-bromo-7-nitro indoline group, when bound to 
a carbonyl, undergoes reaction upon exposure to light at 420 
nm. 

In a second alternative embodiment, the reactive group on 
the linker molecule is selected from a wide variety of 
negative light-reactive groups including a cinammate group. 

Alternatively, the reactive group is activated or deacti- 
vated by electron beam lithography, x-ray lithography, or 
any other radiation. Suitable reactive groups for electron 
beam lithography include sulfonyl. Other methods may be 
used including, for example, exposure to a current source. 
Other reactive groups and methods of activation may be 
used in light of this disclosure. 

As shown in FIG. 1, the linking molecules are preferably 
exposed to, for example, Light through a suitable mask 8 
using photolithographic techniques of the type known in the 
semiconductor industry and described in, for example, Sze, 
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VLSI Technology, McGraw-Hill (1983), and Mead et al., 
Introduction to VLSI Systems, Addison-Weslcy (1980), 
which are incorporated herein by reference for all purposes. 
The light may be directed at either the surface containing the 
protective groups or at the back of the substrate, so long as 5 
. the substrate is transparent to the wavelength of light needed 
for removal, of the protective groups. In .the embodiment 
shown in FIG. 1, light is directed at the surface of the 
substrate containing the protective groups. FIG. 1 illustrates 
the use of such masking techniques as they are applied to a 
positive reactive group so as to activate linking molecules 10 
and expose functional groups in areas 10a and 10b. 

The mask 8 is in one embodiment a transparent support 
material selectively coated with a layer of opaque material. 
Portions of the opaque material are removed, leaving opaque 
material in the precise pattern desired on the substrate 15 
surface. The mask is brought into close proximity with, 
imaged on, or brought directly into contact with the substrate 
surface as shown in FIG. 1. "Openings" in the mask corre- 
spond to locations on the substrate where it is desired to 
remove photoremovable protective groups from the sub- 
strate. Alignment may be performed using conventional 
alignment techniques in which alignment marks (not shown) 
are used to accurately overlay successive masks with pre- 
vious patterning steps, or more sophisticated techniques may 
be used. For example, interferometric techniques such as the 
one described in Flanders et al., "A New Interferometric 
Alignment Technique," App. Phys. Lett. (1977) 31:426-428, 
which is incorporated herein by reference, may be used. 

To enhance contrast of light applied to the substrate, it is 
desirable to provide contrast enhancement materials 
between the mask and the substrate according to some 30 
embodiments. This contrast enhancement layer may com- 
prise a molecule which is decomposed by light such as 
quinone diazid or a material which is transiently bleached at 
the wavelength of interest. Transient bleaching of materials 
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embodiments, the synthesis may take place on or in contact 
with a conventional liquid crystal (referred to herein as a 
"light valve") or fiber optic light sources. By appropriately 
modulating liquid crystals, light may be selectively con- 
trolled so as to permit light to contact selected regions of the 
substrate. Alternatively, synthesis may take place on the end 
of a series of optical fibers to which light is selectively 
applied. Other means of controlling the location of light 
exposure will be apparent to those of skill in the art. 

The substrate may be irradiated either in contact or not in 
contact with a solution (not shown) and is, preferably, 
irradiated in contact with a solution. The solution contains 
reagents to prevent the by-products formed by irradiation 
from interfering with synthesis of the polymer according to 
some embodiments. Such by-products might include, for 
example, carbon dioxide, nttrosocarbonyl compounds, sty- 
re ne derivatives, indole derivatives, and products of their 
photochemical reactions. Alternatively, the solution may 
contain reagents used to match the index of refraction of the 
substrate. Reagents added to the solution may further 
include, for example, acidic or basic buffers, thiols, substi- 
tuted hydrazines and hydroxylamines, reducing agents (e.g., 
NADH) or reagents known to react with a given functional 
group (e.g., aryl nitroso+glyoxylic acid-*aryl 
formhydroxamate+COJ. 

Either concurrently with or after the irradiation step, the 
linker molecules are washed or otherwise contacted with a 
first monomer, illustrated by "A" in regions 12a and 12b in 
FIG. 2. The first monomer reacts with the activated func- 
tional groups of the . linkage molecules which have been 
exposed to light. The first monomer, which is preferably an 
amino acid, is also provided with a photoprotective group. 
The photoprotective group on the monomer may be the same 
as or different than the protective group used in the linkage 
molecules, and may be selected from any of the above- 
described protective groups. In one embodiment, the pro- 



will allow greater penetration where light is applied, thereby 35 tective groups for the A monomer is selected from the group 



enhancing contrast. Alternatively, contrast enhancement 
may be provided by way of a cladded fiber optic bundle. 

The light may be from a conventional incandescent 
source, a laser, a laser diode, or the like. If non-collimated 
sources of light are used it may be desirable to provide a 
thick- or multi-layered mask to prevent spreading of the 
light onto the substrate. It may, further, be desirable in some 
embodiments to utilize groups which are sensitive to differ- 
ent wavelengths to control synthesis. For example, by using 
groups which are sensitive to different wavelengths, it is 
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NBOC and NVOC. 

As shown in FIG. 3, the process of irradiating is thereafter 
repeated, with a mask repositioned so as to remove linkage 
protective groups and expose functional groups in regions 
14a and 14b which are illustrated as being regions which 
were protected in the previous masking step. As an alterna- 
tive to repositioning of the first mask, in many embodiments 
a second mask will be utilized. In other alternative 
embodiments, some steps may provide for illuminating a 
common region in successive steps. As shown in FIG. 3, it 



possible to select branch positions in the synthesis of a 45 may be desirable to provide separation between irradiated 



polymer or eliminate certain masking steps. Several reactive 
groups along with their corresponding wavelengths for 
deprotection are provided in Table 1. 

TABLE 1 



50 



Group 



Approximate ' 
Deprotection Wavelength 



Nitfoveratryloxy carbonyl (NVOC) UV (300-400 nm) 
Nitrobenzyloxy carbonyl (NBOC) UV (300-350 nm) 
Dimethyl dimethoxybenzyloxy carbonyl UV (280-300 nm) 
5-Bromo-7-nitroindolinyl UV (420 nm) 

o-Hydroxy-ct-methyl cinnamoyl UV (300-350 nm) 

2-Oxymethytene a nth ra quinone UV (350 nm) 
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While the invention is illustrated primarily herein by way 
of the use of a mask to illuminate selected regions the 
substrate, other techniques may also be used. For example, 
the substrate may be translated under a modulated laser or 
diode light source. Such techniques are discussed in, for 
example, U.S. Pat. No. 4,719,615 (Feyrer et al.), which is 65 redundancy. 

incorporated herein by reference. In alternative embodi- In one embodiment the regions 12 and 16 on the substrate 
ments a laser galvanometric scanner is utilized. In other will have a surface area of between about 1 cm 2 and 10~ 10 



regions. For example, separation of about 1-5 ^m may be 
appropriate to account for alignment tolerances. 

As shown in FIG. 4, the substrate is then exposed to a 
second protected monomer "B," producing B regions 16a 
and 16b. Thereafter, the substrate, is again masked so as to 
remove the protective groups and expose reactive groups on 
A region 12a and B region 16b. The substrate is again 
exposed to monomer B, resulting in the formation of the 
structure shown in FIG. 6. The dimers B-A and B-B have 
been produced on the substrate. 

A subsequent series of masking and contacting steps 
similar to those described above with A (not shown) pro- 
vides the structure shown in FIG. 7. The process provides all 
possible dimers of B and A, i.e., B-A, A-B, A-A, and B-B. 

The substrate, the area of synthesis, and the area for 
synthesis of each individual polymer could be of any size or 
shape. For example, squares, ellipsoids, rectangles, 
triangles, circles, or portions thereof, along with irregular 
geometric shapes, may be utilized. Duplicate synthesis areas 
may also be applied to a single substrate for purposes of 
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cm 3 . In some embodiments the regions 12 and 16 have areas 
of less than about 10 -1 cm 2 , 10" 2 cm 2 , 10* 3 cm 2 , 10" 4 cm 2 , 
10- 5 cm 2 , lCT* cm 2 , 10" 7 cm 2 , 1CT 8 cm 2 , or 10" 10 cm 2 . In 
a preferred embodiment, the regions 12 and 16 axe between 
about 10x10 /an and 500x500 /an. 

In some embodiments a single substrate supports more 
than about 10 different monomer sequences and perferably 
more than about 100 different monomer sequences, although 
in some embodiments more than about 10 3 , 10 4 , 10 5 , 10 6 , 
10 7 , or 10 s different sequences are provided on a substrate. 
Of course, within a region of the substrate in which a 
monomer sequence is synthesized, it is preferred that the 
monomer sequence be substantially pure. In some 
embodiments, regions of the substrate contain polymer 
sequences which are at least about 1%, 5%, 10%, 15%, 20%, 
25%, 30%, 35%, 40%, 45%, 50%, 60%, 70%, 80%, 90%, 
95%, 96%, 97%, 98%, or 99% pure. 

According to some embodiments, several sequences are 
intentionally provided within a single region so as to provide 
an initial screening for biological activity, after which mate- 
rials within regions exhibiting significant binding are further 
evaluated. 

IV. Details of One Embodiment of a Reactor 

System 

FIG. 8A schematically illustrates a preferred embodiment 
of a reactor system 100 for synthesizing polymers on the 
prepared substrate in accordance with one aspect of the 
invention. The reactor system includes a body 102 with a 
cavity 104 on a surface thereof. In preferred embodiments 
the cavity 104 is between about 50 and 1000 /mi deep with 
a depth of about 500 /on preferred. 

The bottom of the cavity is preferably provided with an 
array of ridges 106 which extend both into the plane of the 
Figure and parallel to the plane of the Figure. The ridges are 
preferably about 50 to 200 /rai deep and spaced at about 2 35 
to 3mm. The purpose of the ridges is to generate turbulent 
flow for better mixing. The bottom surface of the cavity is 
preferably light absorbing so as to prevent reflection of 
impinging light. 

A substrate 112 is mounted above the cavity 104. The 40 
substrate is provided along its bottom surface 114 with a 
photoremovable protective group such as NVOC with or 
without an intervening linker molecule. The substrate is 
preferably transparent to a wide spectrum of light, but in 
some embodiments is transparent only at a wavelength at 45 
which the protective group may be removed (such as UV in 
the case of NVOQ. The substrate in some embodiments is 
a conventional microscope glass slide or cover slip. The 
substrate is preferably as thin as possible, while still pro- 
viding adequate physical support. Preferably, the substrate is 
less than about 1 mm thick, more preferably less than 0.5 
mm thick, more preferably less than 0.1 mm thick, and most 
preferably less than 0.05 mm thick. In alternative preferred 
embodiments, the substrate is quartz or silicon. 

The substrate and the body serve to seal the cavity except 
for an inlet port 108 and an outlet port 110. The body and the 
substrate may be mated for sealing in some embodiments 
with one or more gaskets. According to a preferred 
embodiment, the body is provided with two concentric 
gaskets and the intervening space is held at vacuum to 
ensure mating of the substrate to the gaskets. 

Fluid is pumped through the inlet port into the cavity by 
way of a pump 116 which may be, for example, a model no. 
B-120-S made by Eldex Laboratories. Selected fluids are 
circulated into the cavity by the pump, through the cavity, 
and out the outlet for recirculation or disposal. The reactor 
may be subjected to ultrasonic radiation and/or heated to aid 
in agitation in some embodiments. 
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Above the substrate 112, a lens 120 is provided which 
may be, for example, a 2" 100 mm focal length fused silica 
lens. For the sake of a compact system, a reflective mirror 
122 may be provided for directing light from a light source 
124 onto the substrate. Light source 124 may be, for 
example, a Xe(Hg) light source manufactured by Oriel and 
having model no. 66024. A second lens 126 may be provided 
for the purpose of projecting a mask image onto the substrate 
in combination with lens 112. This form of lithography is 
referred to herein as projection printing. As will be apparent 
10 from this disclosure, proximity printing and the like may 
also be used according to some embodiments. 

Light from the light source is permitted to reach only 
selected locations on the substrate as a result of mask 128. 
Mask 128 may be, for example, a glass slide having etched 
15 chrome thereon. The mask 128 in one embodiment is 
provided with a grid of transparent locations and opaque 
locations. Such masks may be manufactured by, for 
example, Photo Sciences, Inc. Light passes freely through 
the transparent regions of the mask, but is reflected from or 
20 absorbed by other regions. Therefore, only selected regions 
of the substrate are exposed to light. 

As discussed above, light valves (LCD's) may be used as 
an alternative to conventional masks to selectively expose 
regions of the substrate. Alternatively, fiber optic faceplates 
such as those available from Schott Glass, Inc, may be used 
for the purpose of contrast enhancement of the mask or as 
the sole means of restricting the region to which light is 
applied. Such faceplates would be placed directly above or 
on the substrate in the reactor shown in FIG. 8 A. In still 
further embodiments, flys-eye lenses, tapered fiber optic 
faceplates, or the like, may be used for contrast enhance- 
ment. . 

In order to provide for illumination of regions smaller 
than a wavelength of light, more elaborate techniques may 
be utilized. For example, according to one preferred 
embodiment, light is directed at the substrate by way of 
molecular microcrystals on the tip of, for example, micropi- 
pettes. Such devices are disclosed in Lieberman et al., "A 
Light Source Smaller Than the Optical Wavelength," Sci- 
ence (1990) 247:59-61, which is incorporated herein by 
reference for all purposes. 

In operation, the substrate is placed on the cavity and 
sealed thereto. All operations in the process of preparing the 
substrate are carried out in a room lit primarily or entirely by 
light of a wavelength outside of the light range at which the 
protective group is removed. For example, in the case of 
NVOC, the room should be lit with a conventional dark 
room light which provides little or no UV light. All opera- 
tions are preferably conducted at about room temperature. 

A first, deprotection fluid (without a monomer) is circu- 
lated through the cavity. The solution preferably is of 5 >mM 
sulfuric acid in dioxane solution which serves to keep 
exposed amino groups protonated and decreases their reac- 
tivity with photolysis by-products. Absorptive materials 
such as N,N-diethylamino 2,4-dinitrobenzene, for example, 
may be included in the deprotection fluid which serves to 
absorb light and prevent reflection and unwanted photolysis. 

The slide is, thereafter, positioned in a light raypath from 
the mask such that first locations on the substrate are 
illuminated and, therefore, deprotected. In preferred 
embodiments the substrate is illuminated for between about 
60 1 and 15 minutes with a preferred illumination time of about 
10 minutes at 10-20 mW/cm 2 with 365 nm light. The slides 
are neutralized (i.e., brought to a pH of about 7) after 
photolysis with, for example, a solution of 
di-isopropylethylamine (DIEA) in methylene chloride for 
65 about 5 minutes. 

The first monomer is then placed at the first locations on 
the substrate. After irradiation, the slide is removed, treated 
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in bulk, and then reinstalled in the flow cell. Alternatively, 
a fluid containing the first monomer, preferably also pro- 
tected by a protective group, is circulated through the cavity 
by way of pump 116. If, for example, it is desired to attach 
the amino acid Y to the substrate at the first locations, the 
amino acid Y (bearing a protective group on its a-nitrogen), 
along with reagents used to render the monomer reactive, 
and/or a carrier, is circulated from a storage container 118, 
through the pump, through the cavity, and back to the inlet 
of the pump. 

The monomer carrier solution is, in a preferred 
embodiment, formed by mixing of a first solution (referred 
to herein as solution "A") and a second solution (referred to 
herein as solution "B"). Table 2 provides an illustration of a 
mixture which may be used for solution A. 

TABLE 2 

Representative Monomer Carrier Solution "A" 

100 mg NVOC amino protected amino acid 
37 mg HOBT (1-Hydroxybenzolriazole) 
250 fd DMF (Dimethylfonnamide) 
86 fd DIEA (Diisopropylethyl amine) 
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The composition of solution B is illustrated in Table 3. 
Solutions A and B are mixed and allowed to react at room 25 
temperature for about 8 minutes, then diluted with 2 ml of 
DMF, and 500 fi\ are applied to the surface of the slide or the 
solution is circulated through the reactor system and allowed 
to react for about 2 hours at room temperature. The slide is 
then washed with DMF, methylene chloride and ethanol. 30 

TABLE 3 

Representative Monomer Carrier Solution 

250 fd DXF 35 
111 mg BOP (BenzoUiazo!yl-n-oxy-tris(dunethylamino) 
pDosphoniumhexaftuorophosphate) 



As the solution containing the monomer to be attached is 
circulated through the cavity, the amino acid or other mono- 
mer will react at its carboxy terminus with amino groups on 
the regions of the substrate which have been deprotected. Of 
course, while the invention is illustrated by way of circula- 
tion of the monomer through the cavity, the invention could 
be practiced by way of removing the slide from the reactor 
and submersing it in an appropriate monomer solution. 

After addition of the first monomer, the solution contain- 
ing the first amino acid is then purged from the system. After 
circulation of a sufficient amount of the DMF/methylene 
chloride such that removal of the amino acid can be assured 
(e.g., about 50x times the volume of the cavity and carrier 
Lines), the mask or substrate is repositioned, or a new mask 
is utilized such that second regions on the substrate will be 
exposed to light and the light 124 is engaged for a second 
exposure. This will deprotect second regions on the substrate 
and the process is repeated until the desired polymer 
sequences have been synthesized. 

The entire derivatized substrate is then exposed to a 
receptor of interest, preferably labeled with, for example, a 
fluorescent marker, by circulation of a solution or suspen- 
sion of the receptor through the cavity or by contacting the 
surface of the slide in bulk. The receptor will preferentially 
bind to certain regions of the substrate which contain 
complementary sequences. 

Antibodies are typically suspended in what is commonly 
referred to as "supercocktail," which may be, for example, 
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a solution of about 1% BSA (bovine serum albumin), 0.5% 
Tween in PBS (phosphate buffered saline) buffer. The anti- 
bodies are diluted into the supercocktail buffer to a final 
concentration of, for example, about 0.1 to 4 /zg/ml. 

. FIG. SB illustrates an alternative preferred embodiment of 
the reactor shown in FIG. 8 A. . According to this 
embodiment, the mask 128 is placed directly in contact with 
the substrate. Preferably, the etched portion of the mask is 
placed face down so as to reduce the effects of light 
dispersion. According to this embodiment, the imaging 
lenses 120 and 126 are not necessary because the mask is 
brought into close proximity with the substrate. 

For purposes of increasing the signal-to-noise ratio of the 
technique, some embodiments of the invention provide for 
exposure of the substrate to a first labeled or unlabeled 
receptor followed by exposure of a labeled, second receptor 
(e.g., an antibody) which binds at multiple sites on the first 
receptor. If, for example, the first receptor is an antibody 
derived from a first species of an animal, the second receptor 
is an antibody derived from a second species directed to 
epitopes associated with the first species. In the case of a 
mouse antibody, for example, fluoresce Dtly labeled goat 
antibody or antiserum which is antimouse may be used to 
bind at multiple sites on the mouse antibody, providing 
several times the fluorescence compared to the attachment of 
a single mouse antibody at each binding site. This process 
may be repeated again with additional antibodies (e.g., 
goat-mouse-goat, etc.) for further signal amplification. 

In preferred embodiments an ordered sequence of masks 
is utilized. In some embodiments it is possible to use as few 
as a single mask to synthesize all of the possible polymers 
of a given monomer set. 

If, for example, it is desired to synthesize all 16 dinucle- 
o tides from four bases, a 1 cm square synthesis region is 
divided conceptually into 16 boxes, each 0.25 cm wide. 
Denote the four monomer units by A, B, C, and D. The first 
reactions are carried out in four vertical columns, each 0.25 
cm wide. The first mask exposes the left-most column of 
boxes, where A is coupled. The second mask exposes the 
next column, where B is coupled; followed by a third mask, 
for the C column; and a final mask that exposes the right- 
most column, for D. The first, second, third, and fourth 
masks may be a single mask translated to different locations. 

The process is repeated in the horizontal direction for the 
second unit of the dimer. This time, the masks allow 
exposure of horizontal rows, again 0.25 cm wide. A, B, C, 
and D are sequentially coupled using masks that expose 
horizontal fourths of the reaction area. The resulting sub- 
strate contains all 16 dinucleotides of four bases. 

. The eight masks used to synthesize the dinucleotide are 
related to one another by translation or rotation. In fact, one 
mask can be used in all eight steps if it is suitably rotated and 
translated. For example, in the example above, a mask with 
a single transparent region could be sequentially used to 
expose each of the vertical columns, translated 90°, and then 
sequentially used to allow exposure of the horizontal rows. 

Tables 4 and 5 provide a simple computer program in 
Quick Basic for planning a masking program and a sample 
output, respectively, for the synthesis of a polymer chain of 
three monomers ("residues") having three different mono- 
mers in the first level, four different monomers in the second 
level, and five different monomers in the third level in a 
striped pattern. The output of the program is the number of 
cells, the number of "stripes" (light regions) on each mask, 
and the amount of translation required for each exposure of 
the mask. 
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TABLE 4 

Mask Strategy Program 

DEFINT A-Z .. 

DIM b(20), w(20), 1(500) . . , , 

F5--LPT1:" * ' 

OPEN G FOR OUTPUT AS #1 

jmax - 3 'Number of residues 

b(l) - 3: b(2) - 4: b(3) - 5 'Number of building blocks for tea 1,2,3 

g - 1: lmax(l) - 1 

FOR j - 1 TO jmax: g- g • b(J): NEXT j 
w(0) - 0: w(l) - g/b(l) 

PRINT #1, -MASK2.BAS", DATES, TIMES: PRINT #1, 
PRINT #1, USING "Number of residues-**"; jmax 
FOR j - 1 TO jmax 

PRINT #1, USING " Residue ## ## building blocks"; j; b(i) 

NEXT j 
PRINT #3, " 

PRINT #1, USING "Number of cells-****"; g: PRINT #1, 

FOR j - 2 TO jmax 

lmaxQ) - lmax(j - 1) • btf - 1) 

w(j)-w(j-l)/b(j) 

NEXT j 

FOR j - 1 TO jmax 

PRINT #1, USING "Mask for residue #*"; j: PRINT #1, 
PRINT M, USING " Number of stripes-***"; lmax(j) 
PRINT #1, USING " Width of each stripe-***"; w(i) 
FOR 1-1 TO lmaxQ) 
a - 1 + (1 - 1) " w(j - 1) 
ac - a + wQ) - 1 

PRINT #1, USING " Stripe *# begins at location ### and ends at #****; 1; a; ae 
NEXT 1 
PRINT #1, 

PRINT #1, USING " For each of ## building blocks, translate mask by ## 
ceUfsr; bfl); wQ"), 

PRINT #1, : PRINT #1, : PRINT #1, 
NEXT j 

® Copyright 1990, Affymax N. V 
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TABLE 5 



Masking Strategy Output 
Number of residues- 3 

Residue 1 

Residue 2 

Residue 3 
Number of cells- 60 
Mask for residue 1 

Number of stripes- 1 
Width of each stripe- 20 ' 
Stripe 1 begins at location i and ends at 20 
For each of 3 building blocks, translate mask by 20 cell(s) 
Mask for residue 2 

Number of stripes— 3 
Width of each stripe- 5 
Stripe 1 begins at location 1 and ends at 5 
Stripe 2 begins at location 21 and ends at 25 
Stripe 3 begins at location 41 and ends at 45 
For each of 4 building blocks, translate mask by 5 ccll(s) 
Mask for residue 3 

Number of stripes- 12 

Width of each stripe- 1 

Stripe 1 begins at location 1 and ends at 1 

Stripe 2 begins at location 6 and ends at 6 

Stripe 3 begins at location 11 and ends at 11 

Stripe 4 begins at location 16 and ends at 16 



3 building blocks 

4 building blocks 

5 building blocks 
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TABLE 5 -continued 

Masking Strategy Output 

Stripe 5 begins at location 21 and ends at 21 . ' . 

Stripe 6 begins at location 26 and ends at 26 

Stripe 7 begins at location 31 and ends at 31 

Stripe 8 begins at location 36 and ends at 36 

Stripe 9 begins at location 41 and ends at 41 

Stripe 10 begins at location 46 and ends at 46 

Stripe 11 begins at location 51 and ends at 51 

Stripe 12 begins at location 56 and ends at 56 

For each of 5 building blocks, translate mask by 1 cell(s) 

<S Copyright 1990, Affymax N.V. 
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V. Details of One Embodiment of A Fluorescent 
Detection Device 

i 

FIG. 9 illustrates a fluorescent detection device for detect- 
ing fluorescently labeled receptors on a substrate. A sub- 
strate 112 is placed on an x/y translation table 202. In a 2 q 
preferred embodiment the x/y translation table is a model no. 
PM500-A1 manufactured by Newport Corporation. The x/y 
translation table is connected to and controlled by an appro- 
. priately programmed digital computer 204 which may be, 
tor example, an appropriately programmed IBM PC/AT or 
AT compatible computer. Of course, other computer 25 
systems, special purpose hardware, or the like could readily 
be substituted for the AT computer used herein for illustra- 
tion. Computer software for the translation and data collec- 
tion functions described herein can be provided based on 
commercially available software including, for example, 30 
"Lab Windows" licensed, by National Instruments, which is 
incorporated herein by reference for all purposes. . 

The substrate and x/y translation table are placed under a 
microscope 206 which includes one or more objectives 208. 
Light (about 488 nm) from a laser 210, which in some 
embodiments is a model no. 2020-05 argon ion laser manu- 
factured by Spectraphysics, is directed at the substrate by a 
dichroic mirror 207 which passes greater than about 520 nm 
light but reflects 488 nm light. Dichroic mirror 207 may be, 
for example, a model no. FT510 manufactured by Carl 
Zeiss. Light reflected from the mirror then enters the micro- 
scope 206 which may be, for example, a model no. Axioscop 
20 manufactured by Carl Zeiss. Fluoresce in-marked mate- 
rials on the substrate will fluoresce >488 nm light, and the 
fluoresced light will be collected by the microscope and 
passed through the mirror. The fluorescent light from the 
substrate is then directed through a wavelength filter 209 
and, thereafter through an aperture plate 211. Wavelength 
filter 209 may be, for example, a model no. OG530 manu- 
factured by Melles Griot and aperture plate 211 may be, for 
example, a model no. 477352/477380 manufactured by Carl 
Zeiss. 

The fluoresced light then enters a photomultiplier tube 
212 which in some embodiments is a model no. R943-02 
manufactured by Hamamatsu, the signal is amplified in 
preamplifier 214 and photons are counted by photon counter 
216. The number of photons is recorded as a function of the 
location in the computer 204. Pre-Amp 214 may be, for 
example, a model no. SR440 manufactured by Stanford 
Research Systems and photon counter 216 may be a model 
no. SR400 manufactured by Stanford Research Systems. 
The substrate is then moved to a subsequent location and the 
process is repeated. In preferred embodiments the data are 
acquired every 1 to 100 /im with a data collection diameter 
of about 0.8 to 10 pm preferred. In embodiments with 
sufficiently high fluorescence, a CCD detector with broad- 
field illumination is utilized. 

By counting the number of photons generated in a given 
area in response to the laser, it is possible to determine where 



fluorescent marked molecules are located on the substrate. 
Consequently, for a slide which has a matrix of polypeptides, 
for example, synthesized on the surface thereof, it is possible 
to determine which of the polypeptides is complementary to 
a fluorescently marked receptor. 

According to preferred embodiments, the intensity and 
duration of the light applied to the substrate is controlled by 
varying the laser power and scan stage rate for improved 
signal-to-noise ratio by maximizing fluorescence emission 
and minimizing background noise. 

While the detection apparatus has been illustrated prima- 
rily herein with regard to the detection of marked receptors, 
the invention will find application in other areas. For 
example, the detection apparatus disclosed herein could be 
used in the fields of catalysis, DNA or protein gel scanning, 
and the like. 

VI. Determination of Relative Binding Strength of 

Receptors 

The signal-to-noise ratio of the present invention is suf- 
ficiently high that not only can the presence or absence of a 
receptor on a b'gand be detected, but also the relative binding 
affinity of receptors to a variety of sequences can be deter- 
mined. 

In practice it is found that a receptor will bind to several 
peptide sequences in an array, but will bind much more 
strongly to some sequences than others. Strong binding 
affinity will be evidenced herein by a strong fluorescent or 
radiographic signal since many receptor molecules will bind 
in a region of a strongly bound ligand. Conversely, a weak 
*" binding affinity will be evidenced by a weak fluorescent or 
radiographic signal due to the relatively small number of 
receptor molecules which bind in a particular region of a 
substrate having a ligand with a weak binding affinity for the 
receptor, consequently, it becomes possible to determine 
50 relative binding avidity (or affinity in the case of univalent 
interactions) of a ligand . herein by way of the intensity of a 
fluorescent or radiographic signal in a region containing that 
ligand. 

Semiquantitative data on affinities might also be obtained 
55 by varying washing conditions and concentrations of the 
receptor. This would be done by comparison to known 
ligand receptor pairs, for example. 

VII. Examples 

60 The following examples are provided to illustrate the 
efficacy of the inventions herein. All operations were con- 
ducted at about ambient temperatures and pressures unless 
indicated to the contrary. 
A. Slide Preparation 
65 Before attachment of reactive groups it is preferred to 
clean the substrate which is, in a preferred embodiment a 
glass substrate such as a microscope slide or cover slip. 
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According to one embodiment the slide is soaked in an 
alkaline bath consisting of, for example, 1 liter of 95% 
ethanol with 120 ml of water and 120 grams of sodium 
hydroxide for 12 hours. The slides are then washed under 
running water and allowed to air dry, and rinsed once with 5 
a solution of 95% ethanol . '■. 

The slides are . then aminated with, for example, amino- 
propyltriethoxysilane for the purpose of attaching amino 
groups to the glass surface on linker molecules, although any 
omega functionalized silane could also be used for this 
purpose. In one embodiment 0.1% aminopropyltriethoxysi- 10 
lane is utilized, although solutions with concentrations from 
10" 7 % to 10% may be used, with about 10" 3 % to 2% 
preferred. A 0.1% mixture is prepared by adding to 100 ml 
of a 95% ethanol/5% water mixture, 100 microliters Qd) of 
aminopropyltriethoxysilane. The mixture is agitated at about 15 
ambient temperature on a rotary shaker for about 5 minutes. 
500 ju\ of this mixture is then applied to the surface of one 
side of each cleaned slide. After 4 minutes, the slides are 
decanted of this solution and rinsed three times by dipping 
in, for example, 100% ethanol. 20 

After the plates dry, they are placed in a 110-120° C. 
vacuum oven for about 20 minutes, and then allowed to cure 
at room temperature for about 12 hours in an argon envi- 
ronment. The slides are then dipped into DMF 
(dimethylformamide) solution, followed by a thorough 25 
washing with methylene chloride. 

The aminated surface of the slide is then exposed to about 
500 ^1 of, for example, a 30 millimolar (mM) solution of 
NVOC-GABA (gamma amino butyric acid) NHS 
(N-hydroxysuccinimide) in DMF for attachment of a 
NVOC-GABA to each of the amino groups. 30 

The surface is washed with, for example, DMF, methyl- 
ene chloride, and ethanol. 

Any unreacted aminopropyl silane on the surface — that is, 
those amino groups which have not had the NVOC-GABA 
attached — are now capped with acetyl groups (to prevent 35 
further reaction) by exposure to a 1:3 mixture of acetic 
anhydride in pyridine for 1 hour. Other materials which may 
perform this residual capping function include trifluoroace- 
tic anhydride, formicacetic anhydride, or other reactive 
acylating agents. Finally, the slides are washed again with 4Q 
DMF, methylene chloride, and ethanol 
B. Synthesis of Eight Trimers of "A" and "B" 

FIG. 10 illustrates a possible synthesis of the eight trimers 
of the two-monomer set: gly, phe (represented by "A" and 
"B," respectively). A glass slide bearing silane groups ter- 45 
minating in 6-nitroveratryloxycarboxamide (NVOC-NH) 
residues is prepared as a substrate. Active esters 
(pentafluorophenyl, OBt, etc.) of gly and phe protected at the 
amino group with NVOC are prepared as reagents. While 
not pertinent to this example, if side chain protecting groups, 
are required for the monomer set, these must not be photo- 50 . 
reactive at the wavelength of light used to protect the 
primary chain. 

For a monomer set of size n, nxl cycles are required to 
synthesize all possible sequences of length 1. A cycle con- 
sists of: 55 

1. Irradiation through an appropriate mask to expose the 
amino groups at the sites where the next residue is to be 
added, with appropriate washes to remove the 
by-products of the deprotection. 

2. Addition of a single activated and protected (with the 60 
same photochemically-removable group) monomer, 
which will react only at the sites addressed in step 1, 
with appropriate washes to remove the excess reagent 
from the surface. 

The above cycle is repeated for each member of the 65 
monomer set until each location on the surface has been 
extended by one residue in one embodiment. In other 



embodiments, several residues are sequentially added at one 
location before moving on to the next location. Cycle times 
will generally be limited by the coupling reaction rate, now 
as short as 20 min in automated peptide synthesizers. This 
step is optionally followed by addition of a protecting group 
to stabilize the array for later testing. For some types of 
polymers (e.g., peptides), a final deprotection of the entire 
surface (removal of photoprotective side chain groups) may. 
be required. 

More particularly, as shown in FIG. 10A, the glass 20 is 
provided with regions 22, 24, 26, 28, 30, 32, 34, and 36. 
Regions 30, 32, 34, and 36 are masked, as shown in FIG. 
10B and the glass is irradiated and exposed to a reagent 
containing "A" (e.g., gly), with the resulting structure shown 
in FIG. 10C Thereafter, regions 22, 24, 26, and 28 are 
masked, the glass is irradiated (as shown in FIG. 10D) and 
exposed to a reagent containing "B" (e.g., phe), with the 
resulting structure shown in FIG. 10E. The process 
proceeds, consecutively masking and exposing the sections 
as shown until the structure shown in FIG. 10M is obtained. 
The glass is irradiated and the terminal groups are, 
optionally, capped by acetylation. As shown, all possible 
trimers of gly/phe are obtained. 

In this example, no side chain protective group removal is 
necessary. If it is desired, side chain deprotection may be 
accomplished by treatment with ethanedithiol and trifluoro- 
acetic acid. 

In general, the number of steps needed to obtain a 
particular polymer chain is defined by: 



(1) 



where: 

n«the number of monomers in the basis set of monomers, 
and 

Mhe number of monomer units in a polymer chain. 
Conversely, the synthesized number of sequences of 
length 1 will be: 



J. 



(2) 



Of course, greater diversity is obtained by using masking 
strategies which will also include the synthesis of polymers 
having a length of less than 1. If, in the extreme case, all 
polymers having a length less than or equal to 1 are 
synthesized, the number of polymers synthesized will be: 



+11 



(3) 



The maximum number of lithographic steps needed will 
generally be n for each "layer" of monomers, i.e., the total 
number of masks (and, therefore, the number of lithographic 
steps) needed will be nxl. The size of the transparent mask 
regions will vary in accordance with the area of the substrate 
available for synthesis and the number of sequences to be 
formed. In general, the size of the synthesis areas will be: 

size of synthesis areas-(A)/(S) 

where: 

A is the total area available for synthesis; and 
S is the number of sequences desired in the area. 
It will be appreciated by those of skill in the art that the 
above method could readily be used to simultaneously 
produce thousands or millions of oligomers on a substrate 
using the photolithographic techniques disclosed herein. 
Consequently, the method results in the ability to practically 
test large numbers of, for example, di, tri, tetra, penta, hexa, 
hepta, octapeptides, dodecapeptides, or larger polypeptides 
(or correspondingly, polynucleotides). 
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The above example has illustrated the method by way of 
a manual example. It will of course be appreciated that 
automated or semi-automated methods could be used. The 
substrate would be mounted in a flow cell for automated 
addition and removal of reagents, to minimize the volume of 5 
reagents needed, and to more carefully control reaction 
conditions. Successive masks could be applied manually or 
automatically. 

C. Synthesis of a Dimer of an Aminopropyl Group and a 
Fluorescent Group 

In synthesizing the dimer of an aminopropyl group and a io 
fluorescent group, a functionalized durapore membrane was 
used as a substrate. The durapore membrane was a polyvi- 
nylidine difluoride with aminopropyl groups. The amino- 
propyl groups were protected with the DDZ group by 
reaction of the carbonyl chloride with the amino groups, a 15 
reaction readily known to those of skill in the art. The 
surface bearing these groups was placed in a solution of THF 
and contacted with a mask bearing a checkerboard pattern of 
1 mm opaque and transparent regions. The mask was 
exposed to ultraviolet light having a wavelength down to at 
least about 280 nm for about 5 minutes at ambient 
temperature, although a wide range of exposure times and 
temperatures may be appropriate in various embodiments of 
the invention. For example, in one embodiment, an exposure 
time of between about 1 and 5000 seconds may be used at 
process temperatures of between -70 and +50° C. 25 

In one preferred embodiment, exposure times of between 
about 1 and 500 seconds at about ambient pressure are used. 
In some preferred embodiments, pressure above ambient is 
used to prevent evaporation. 

The surface of the membrane was then washed for about 30 
1 hour with a fluorescent label which included an active ester 
bound to a chelate of a lanthanide. Wash times will vary over 
a wide range of values from about a few minutes to a few 
hours. These materials fluoresce in the red and the green 
visible region. After the reaction with the active ester in the 35 
fluorophore was complete, the locations in which the fluo- 
rophore was bound could be visualized by exposing them to 
ultraviolet light and observing the red and the green fluo- 
rescence. It was observed that the derivatized regions of the 
substrate closely corresponded to the original pattern of the 
mask. 

D. Demonstration of Signal Capability 

Signal detection capability was demonstrated using a 
low-level standard fluorescent bead kit manufactured by 
Row Cytometry Standarda and having model no. 824. This 
kit includes 5.8 //m diameter beads, each impregnated with 4 * 
a known number of fluorescein molecules. 

One of the beads was placed in the illumination field on 
the scan stage as shown in FIG. 9 in a field of a laser spot 
which was initially shuttered. After being positioned in the 
illumination Held, the photon detection equipment was 50 
turned on. The laser beam was unblocked and it interacted 
with the particle bead, which then fluoresced. Fluorescence 
curves of beads impregnated with 7,000; 13,000; and 29,000 
fluorescein molecules, are shown in FIGS. 11 A, 11B, and 
11C respectively. On each curve, traces for beads without 55 
fluorescein molecules are also shown. These experiments 
were performed with 488 nm excitation, with 100 of 
laser power. The light was focused through a 40 power 0.75 
NA objective. 

The fluorescence intensity in all cases started off at a high 
value and then decreased exponentially. The fall-off in 
intensity is due to photobleaching of the fluorescein mol- 
ecules. The traces of beads without fluorescein molecules 
are used for background subtraction. The difference in the 
initial exponential decay between labeled and nonlabeled 
beads is integrated to give the total number of photon counts, 65 
and this number is related to the number of molecules per 
bead. Therefore, it is possible to deduce the number of 



photons per fluorescein molecule that can be detected. For 
the curves illustrated in FIG. II, this calculation indicates 
the radiation of about 40 to 50 photons per fluorescein 
molecule are detected. 

E. Determination of the Number of Molecules Per Unit Area 
Aminopropylated glass microscope slides prepared 

according to the methods discussed above were utilized in 
order to establish the density of labeling of the slides. The 
free amino termini of the slides were reacted with FITC 
(fluorescein isothiocyanate) which forms a covalent linkage 
with the amino group. The slide is then scanned to count the 
number of fluorescent photons generated in a region which, 
using the estimated 40-50 photons per fluorescent molecule, 
enables the calculation of the number of molecules which 
are on the surface per unit area. 

A slide with aminopropyl silane on its surface was 
immersed in a 1 mM solution of FITC in DMF for 1 hour at 
about ambient temperature. After reaction, the slide was 
washed twice with DMF and then washed with ethanol, 
water, and then ethanol again. It was then dried and stored 
in the dark until it was ready to be examined. 

Through the use of curves similar to those shown in FIG. 
11, and by integrating the fluorescent counts under the 
exponentially decaying signal, the number of free amino 
groups on the surface after derivitization was determined. It 
was determined that slides with labeling densities of 1 
fluoroscein per 10 3 xl0 3 to -2x2 nm could be reproducibly 
made as the concentration of aminopropyltriethoxysilane 
varied from 10" 5 % to 10* a %. 

F. Removal of NVOC and Attachment of a Fluorescent 
Marker 

NVOC-GABA groups were attached as described above. 
The entire surface of one slide was exposed to light so as to 
expose a free amino group at the end of the gamma amino 
butyric acid. This slide, and a duplicate which was not 
exposed, were then exposed to fluorescein isothiocyanate 
(FITC). 

FIG. 12A illustrates the slide which was not exposed to 
light, but which was exposed to FITC. The units of the x axis 
are time and the units of the y axis are counts. The trace 
contains a certain amount of background fluorescence. The 
duplicate slide was exposed to 350 nm broadband illumi- 
nation for about 1 minute (12 mW/cm 2 , -350 nm 
illumination), washed and reacted with FITC. The fluores- 
cence curves for this slide are shown in FIG. 12B. A large 
increase in the level of fluorescence is observed, which 
indicates photolysis has exposed a number of amino groups 
on the surface of the slides for attachment of a fluorescent 
marker. 

G. Use of a Mask in Removal of NVOC 

The next experiment was performed with a 0.1% amino- 
propylated slide. Light from a Hg — Xe arc lamp was imaged 
onto the substrate through a laser-ablated chrome-on-glass 
mask in direct contact with the substrate. 

This slide was illuminated for approximately 5 minutes, 
with 12 mW of 350 nm broadband light and then reacted 
with the 1 mM FITC solution. It was put on the laser 
detection scanning stage and a graph was plotted as a 
two-dimensional representation of position versus fluores- 
cence intensity. The fluorescence intensity (in counts) as a 
function of location is given on the scale to the right of FIG. 
13A for a mask having 100x100 fan squares. 

The experiment was repeated a number of times through 
various masks. The fluorescence pattern for a 50 fim mask is 
illustrated in FIG. 13B, for a 20 ^m mask in FIG. 13C, and 
for a 10 fan mask in FIG. 13D. The mask pattern is distinct 
down to at least about 10 fan squares using this lithographic 
technique. 

H. Attachment of YGGFL and Subsequent Exposure to Herz 
Antibody and Goat Antimouse 

In order to establish that receptors to a particular polypep- 
tide sequence would bind to a surface-bound peptide and be 
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detected, Leu enkephalin was coupled to the surface and 
recognized by an antibody. A slide was derivatized with 

0. 1% amino propyl-triethoxysilane and protected with 
NVOC. A 500 pm checkerboard mask was used to expose 
the slide in a flow cell using backside contact printing. The 5 
Leu enkephalin sequence (H 2 N-tyrosine,glycine f glycine, . 
phenylalanine,leucine-CQ 2 H, otherwise referred to herein as 
YGGFL) was attached via its carboxy end to the exposed 
amino groups on the surface of the slide. The peptide was 
added in DMF solution with the BOP/HOBT/DIEA cou- 
pling reagents and recirculated through the flow cell for 2 10 
hours at room temperature. 

A first antibody, known as the Herz antibody, was applied 
to the surface of the slide for 45 minutes at 2 //g/ml in a 
supercocktail (containing 1% BSA and 1% ovalbumin also 
in this case). A second antibody, goat anti-mouse fluorescein 15 
conjugate, was then added at 2 /*g/ml in the supercocktail 
buffer, and allowed to incubate for 2 hours. 

The results of this experiment are provided in FIG. 14. 
Again, this figure illustrates fluorescence intensity as a 
function of position. The fluorescence scale is shown on the 
right. This image was taken at 10 steps. This figure 
indicates that not only can deprotection be carried out in a 
well defined pattern, but also that (1) the method provides 
for successful coupling of peptides to the surface of the 
substrate, (2) the surface of a bound peptide is available for 
binding with an antibody, and (3) that the detection appa- 25 
ratus capabilities are sufficient to detect binding of a recep- 
tor. 

1. Monomer-by-Monomer Formation of YGGFL and Sub- 
sequent Exposure to Labeled Antibody 

Monomer-by-monomer synthesis of YGGFL and GGFL 30 
in alternate squares was performed on a slide in a checker- 
board pattern and the resulting slide was exposed to the Herz 
antibody. This experiment and the results thereof are illus- 
trated in FIGS. 15A, 15B, 15C, and 15D. 

In FIG. 15A, a slide is shown which is derivatized with 35 
the aminopropyl group, protected in this case with t-BOC 
(t-butoxycarbonyl). The slide was treated with TEA to 
remove the t-BOC protecting group. E-aminocaproic acid, 
which was t-BOC protected at its amino group, was then 
coupled onto the aminopropyl groups. The aminocaproic 
acid serves as a spacer between the aminopropyl group and 
the peptide to be synthesized. The amino end of the spacer 
was deprotected and coupled to NVOC-leucine. The entire 
slide was then illuminated with 12 mW of 325 nm broad- 
band illumination. The slide was then coupled with NVOC- 
phenylalanine and washed. The entire slide was again 45 
illuminated, then coupled to NVOC-glycine and washed. 
The slide was again illuminated and coupled to NVOC- 
glycine to form the sequence shown in the last portion of 
FIG. 15A. 

As shown in FIG. 15B, alternating regions of the slide 50 
were then illuminated using a projection print using a 
500x500 fim checkerboard mask; thus, the amino group of 
glycine was exposed only in the lighted areas. When the next 
coupling chemistry step was carried out, NVOC-tyrosine 
was added, and it coupled only at those is spots which had 55 
received illumination. The entire slide was then illuminated 
to remove all the NVOC groups, leaving a checkerboard of 
YGGFL in the lighted areas and in the other areas, GGFL. 
The Herz antibody (which recognizes the YGGFL, but not 
GGFL) was then added, followed by goat anti-mouse fluo- 
rescein conjugate. 

The resulting fluorescence scan is shown in FIG. 15C, and 
the scale for the fluorescence intensity is again given on the 
right. Dark areas contain the tetrapeptide GGFL, which is 
not recognized by the Herz antibody (and thus there is no 
binding of the goat anti-mouse antibody with fluorescein S5 
conjugate), and in the red areas YGGFL is present. The 
YGGFL pentapeptide is recognized by the Herz antibody 
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and, therefore, there is antibody in the lighted regions for the 
fluoresce in-conjugated goat anti-mouse to recognize. 

Similar patterns are shown for a 50 /an mask used in 
direct contact ("proximity print") with the substrate in FIG. 
15D. Note that the pattern is more distinct and the corners 
of the checkerboard pattern are touching when the mask is 
placed in direct contact with the substrate (which reflects the 
increase in resolution using this technique). : 
J. Monomer-by-Monomer Synthesis of YGGFL and PGGFL 

A synthesis using a 50 ftm checkerboard mask similar to 
that shown in FIG. 15 was conducted. However, P was added 
to the GGFL sites on the substrate through an additional 
coupling step. P was added by exposing protected GGFL to 
light through a mask, and subsequence exposure to P in the 
manner set forth above. Therefore, half of the regions on the 
substrate contained YGGFL and the remaining half con- 
tained PGGFL. 

The fluorescence plot for this experiment is provided in 
FIG. 16. As shown, the regions are again readily discernable. 
This experiment demonstrates that antibodies are able to 
recognize a specific sequence and that the recognition is not 
length-dependent. 

K. Monomer-by-Monomer Synthesis of YGGFL and YPG- 
GFL 

In order to further demonstrate the operability of the 
invention, a 50 /an checkerboard pattern of alternating 
YGGFL and YPGGFL was synthesized on a substrate using 
techniques like those set forth above. The resulting fluores- 
cence plot is provided in FIG. 17. Again, it is seen that the 
antibody is clearly able to recognize the YGGFL sequence 
and does not bind significantly at the YPGGFL regions. 
L. Synthesis of an Array of Sixteen Different Amino Acid 
Sequences and Estimation of Relative Binding Affinity to 
Herz Antibody 

Using techniques similar to those set forth above, an array 
of 16 different amino acid sequences (replicated four times) 
was synthesized on each of two glass substrates. The 
sequences were synthesized by attaching the sequence 
NVOC-GFL across the entire surface of the slides. Using a 
series of masks, two layers of amino acids were then 
selectively applied to the substrate. Each region had dimen- 
sions of 0.25 cmxO.0625 cm. The first slide contained amino 
acid sequences containing only L amino acids while the 
second slide contained selected D amino acids. FIGS. 18A 
and 18B illustrate a map of the various regions on the first 
and second slides, respectively. The patterns shown in FIGS. 
18A and 18 B were duplicated four times on each slide. The 
slides were then exposed to the Herz antibody and 
fluorescein-labeled goat anti-mouse. 

FIG. 19 is a fluorescence plot of the first slide, which 
contained only L amino acids. Red indicates strong binding 
(149,000 counts or more) while black indicates little or no 
binding of the Herz antibody (20,000 counts or less). The 
bottom right-hand portion of the slide appears "cut off" 
because the slide was broken during processing. The 
sequence YGGFL is clearly most strongly recognized. The 
sequences YAGFL and YSGFL also exhibit strong recogni- 
tion of the antibody. By contrast, most of the remaining 
sequences show little or no binding. The four duplicate 
portions of the slide are extremely consistent in the amount 
of binding shown therein. 

FIG. 20 is a fluorescence plot of the second slide. Again, 
strongest binding is exhibited by the YGGFL sequence. 
Significant binding is also detected to YaGFL, YsGFL, and 
YpGFL. The remaining sequences show less binding with 
the antibody. Note the low binding efficiency of the 
sequence yGGFL. 

Table 6 lists the various sequences tested in order of 
relative fluorescence, which provides information regarding 
relative binding affinity. 
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A 2.91-billion base pair (bp) consen^quence of the euchromatic portion of 
the human genome was generated by the whole-genome shotgun ^«nong 
method. The 14.8-billion bp DNA sequence was generated over 9 months from 
27 271.853 high-quality sequence reads (5.11-fold coverage of the genome) 
from both ends of plasmid clones made from the DNA of five individuals. Two 
assembly strategies-a whole-genome assembly and a regional chromosome 
assembly-were used, each combining sequence data from ^elera and the 
publicly funded genome effort. The public data were shredded intc > 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
sequenced, without including biases inherent in the cloning and assembly 
procedure used by the publicly funded group. This brought the effective cov- 
erage in the assemblies to eightfold, reducing the number and I size of gaps in 
the final assembly over what would be obtained with 5.11-fold coverage. The 
two assembly strategies yielded very similar results that largely agree with 
independent mapping data. The assemblies effectively cover the euchromatic 
regions of the human chromosomes. More than 90% of the genome is in 
scaffold assemblies of 100.000 bp or more, and 25% of the genome is in 
scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 
26 588 protein-encoding transcripts for which there was strong corroboratmg 
evidence and an additional -12,000 computationally derived genes vv.th mouse 
matches or other weak supporting evidence. Although gene-dense dusters are 
obvious, almost half the genes are dispersed in low C+C sequence ^separated 
by large tracts of apparently noncoding sequence. Only 1.1% of the genome 
is spanned by exons. whereas 24% is in introns, with 75% of the genome be ng 
intergenic DNA. Duplications of segmental blocks, ranging m s.ze up to chro- 
mosomal lengths, are abundant throughout the f^^^Z^- 
evolutionary history. Comparative genomic analyse indicates vertebrate ex 
Mansions of genes associated with neuronal function with ^ue-spec f-c de- 
velopmental regulation, and with the hemostasis and ™™ e ° N * 
sequence comparisons between the consensus sequence and P« bll ^ u "« d 
genome data provided locations of 2.1 millionsingle-nucleot.de F*P™ft™" 
(SNPs). A random pair of human haploid genomes differed at ^a rate of 1 bp per 
1250 on average, but there was marked heterogeneity m the evel of poly- 
rnorpWsrn acrofs the genome. Less than 1 % of all SNPs resulted in var.at.on in 
Steins but the task of determining which SNPs have functional consequences 
remains an open challenge. 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution, the causation 
of disease, and the interplay between the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 
quence of the human genome was first for- 
mally proposed in 1985 (1). In subsequent 
years, the idea met with mixed reactions m 
the scientific community (2). However, in 
1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments. 

The modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for determining the order of nucleotides of 



«} using chain-terminating nucleotide ana- 
(3). In the same year, the first human gene 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
in the Sanger sequencing method that included 
attaching fluorescent dyes to the nucleotides, 
which permitted them to be sequentially read 
by a computer. The first automated DNA se- 
quencer, developed by Applied Biosystems in 
California in 1987, was shown to be successful 
when the sequences of two genes were obtained 

with this new technology (6). From early se- ' 
quencing of human genomic regions (7), it 
became clear that cDNA sequences (which are 
reverse-transcribed from RNA) would be es- 
sential to annotate and validate gene predictions 
in the human genome. These studies were the 
basis in part for the development of . the ex- 
pressed sequence tag (EST) method of gene 
identification (8), which is a random selection, 
very high throughput sequencing approach to 
characterize cDNA libraries. The EST method 
led to the rapid discovery and mapping of hu- 
man genes (9). The increasing numbers of hu- 
man EST sequences necessitated the develop- 
ment of new computer algorithms to analyze 
large amounts of sequence data, and in 1993 at 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
bly and analysis of hundreds of thousands of 
ESTs. This algorithm permitted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies (10). 

■ The complete 49-kbp bacteriophage lamb- 
da genome sequence was deterrnined by a 
shotgun restriction digest method in 1982 
( 1 1). When considering methods for sequenc- 
ing the smallpox virus genome in 1991 (12), 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However, in 1994, 
when a microbial genome-sequencing project 
was contemplated at TIGR, a whole-genome 
shotgun sequencing approach was considered 
possible with the TIGR EST assembly algo- 
rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(13). The experience with several subsequent 
genome-sequencing efforts, established the 
broad applicability of this approach (24, 15). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in length from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes led to the 
suggestion (16) of an approach to simulta- 
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neously map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(77, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAC end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
from the Arabidopsis thaliana genome (19). 

In 1997, Weber and Myers (20) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received (27). However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear that the rate of progress 
in human genome sequencing worldwide 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the ABI PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with . 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
TIGR (23). Many of the principles of operation 
of a genome-sequencing facility were estab- 
lished in the HGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromaric 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (2P), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to ~5-foId 



coverage and to use the unordered and unori- 
ented BAC sequence fragments and subassem- 
blies published in GenBank by the publicly 
funded genome effort (30) to accelerate the 
project We also abandoned the quarterly an- 
nouncements in the absence of interim assem- 
blies to report 

Although this strategy provided a reason- 
able result very early that was consistent with a 
whole-genome shotgun assembly with eight- 
fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the —3 
billion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the final sequence from chi- 
meric clones, foreign DNA contamination, or 
misassembled contigs. Insofar as a correctly-- 
and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/29 1 / 
5507/1304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 



1 . Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome- Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods * 

Summary. This section discusses the rational 
and ethical rules governing donor selection to 
ensure ethnic and gender diversity along win, 

- the methodologies for DNA extraction and Ij. 
brary construction- The plasmid library con- 
struction is the first critical step in shotgun 
sequencing. If the DNA libraries are not uni* 
form in size, nonchimeric, and do not randomly 
represent the genome, then the subsequent stcna 

. cannot accurately reconstruct the genome se- 
quence. We used automated high-throughput 
DNA sequencing and the computational infra- 
structure to enable efficient tracking of enor- 
mous amounts of sequence information (27,3 
million sequence reads; 14.9 billion bp of sc* 
quence). Sequencing and tracking from both 
ends of plasmid clones from 2-, 10-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 
indicates that the accurate pairing rate of end 
sequences was greater than 98%. 

Various policies of the United States and the 
World Medical Association, specifically the 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (IRB) (31) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, , on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African-American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, ~130 ml of whole, 
heparinized blood was collected. From males, 
-130 ml of whole, heparinized blood was 
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collected, as well as five specimens of se 
collected over a 6-week period. Perman 
lymphoblastoid cell lines were created by 
Epstein-Barr vims immortalization. DNA 
from five subjects was selected for genomic 
DNA sequencing: two males and three fe- 
males — one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and two 
Caucasians (see Web fig. 2 on Science Online 
at www.sciencemag.org^cgi/content/291/5507/ 
1304/DC1). The decision of whose DNA to 
sequence was based on a complex mix of fac- 
tors, including the goal of achieving diversity as 
well as technical issues such as the quality of 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quality plas- 
mid libraries in a variety of insert sizes so that 
pairs of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert. 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondrial genome 
and Escherichia coli genomic DNA, DNA from 
each donor was used to construct plasmid librar- 
ies in one or more of three size classes: 2 kbp, 1 0 
kbp, and 50 kbp (Table 1) (33). 

In designing the DNA-sequencing pro- 
cess, we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and monitored ef- 
fectively (Fig. 2) (34). 

Current sequencing protocols are based on 

Table 1. Celera-generated data input into assembly. 
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the dideoxy sequencing method (35), which 
typically yields only 500 to 750 bp of sequence 
per reaction. This limitation on read length has 
made monumental gains in throughput a pre- 
requisite for the analysis of large eukaryotic 
genomes. We accomplished this at the Celera 
facility, which occupies about 30,000 square 
feet of laboratory space and produces sequence 
data continuously at a rate of 175,000 total 
reads per day. The DNA-sequencing facility is 
.supported by a high-performance computation- 
al facility (36). 

. The process for DNA sequencing was mod- 
ular by design and automated. Intermodule 
sample backlogs allowed four principal 
modules to operate independently: (i) li- 
brary transformation, plating, and colony 
picking; (ii) DNA template preparation; 
(iii) dideoxy sequencing reaction set-up 
and purification; and (iv) sequence deter- 
mination with the ABI PRISM 3700 DNA 
Analyzer. Because the inputs and outputs 
of each module have been carefully 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
without a single day's interruption since the 
initiation of the Drosophila project in May 
1999. The 'ABI 3700 is a fully automated 
capillary array sequencer and as such can 
be operated with a minimal amount of 
hands-on time, currently estimated at about 
15 min per day. The capillary system also 
facilitates correct associations of sequenc- 
ing traces with samples through the elimi- 
nation of manual sample loading and lane- 
tracking errors associated with slab gels. 
About 65 production staff were hired and 
trained, and were rotated on a regular basis 



•bgh the four production modules. A 
ral laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation before 
implementation, and production-scale testing 
of any process changes. 

1.2 Trace processing 

An automated trace-processing pipeline has 
been developed to process each sequence file 
(37). After quality and vector triniming, the 
average trimmed sequence length was 543 

• bp, and the sequencing accuracy was expo- 
nentially distributed with a mean of 99.5% 
and with less than 1 in 1000 reads being less 
than 98% accurate (26). Each trimmed se- 
quence was screened for matches to contam- 
inants including sequences of vector alone, E. 
coli genomic DNA, and human mitochondri- 
al DNA. The entire read for any sequence 
with a significant match to a contaminant was 
discarded. A total of 713 reads matched E. 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1.3 Quality assessment and control 

The importance of the base-pair level ac- 
curacy of the sequence data increases as the 

* size and repetitive nature of the genome to 
be sequenced increases. Each sequence 
read must be placed uniquely in the ge- 



Number of reads for different insert libraries 



No. of sequencing reads 



Fold sequence coverage 
(2.9-Cb genome) 



Fold clone coverage 



Insert size* (mean) 
Insert size* (SD) 
% Matest 



Individual 



A 
B 
C 
D 
F 

Total 

A 
B 
C 
D 
F 

Total 

A 
B 
C 
D 
F 

Total 

Average 
Average 
Average 



2 kbp 



10 kbp 



50 kbp 



Total 



Total number of 
base pairs 



11,736,757 
853,819 
952,523 
0 

13,543.099 
0 

220 
0.16 
0.18 
0 

2.54 
0 

2.96 
0.22 
0.24 
0 

3.42 

1,951 bp 
6.10% 
74.50 



0 

7,467,755 
881,290 
1,046,815 
1,498,607 
10.894.467 

. 0 
1.40 
1.17 
0.20 
028 
2.04 

0 

11.26 
1.33 
1.58 
226 

16.43 

10,800 bp 
8.10% 
80.80 



2,767,357 
66,930 
0 
0 
0 

2,834.287 

0.52 
0.01 
0 
0 
0 

0.53 

1839 
0.44 
0 
0 
0 

18.84 

50.715 bp 
14.90% 
75.60 



2,767,357 
19,271,442 
1.735,109 
1.999,338 
1,498.607 
27,271,853 

0.52 
3.61 
032 
037 
0.28 
5.11 

18.39 
14.67 
1.54 
1.82 
226 
38.68 



1,502,674,851 
10,464.393.006 
942.164.187 
1,085.640,534 
813,743,601 
14,808,616.179 



•Insert size and SD are calculated from assembly of mates on contigs. t% Mates is based on laboratory tracking of sequencing runs. 
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nome, and even a modest error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project (26). By collecting data for the 



entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary. We describe in this section the two 
approaches that we used to assemble the ge- 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping information. The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 

. sentially the same reconstruction of assembled 
DNA sequence with proper order and orienta- 

,.tion. The second method provided slightly 
-' greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 

. phase. In addition, we document the complete- 
ness and correctness of this assembly process 
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Fig. 2. Flow diagram for sequencing pipeline. Samples are received, 
selected, and processed in compliance with standard operating proce- 
dures, with a focus on quality within and across departments. Each 
process has defined inputs and outputs with the capability to exchange 



samples and data with both internal and external entities according to 
defined quality guidelines. Manufacturing pipeline processes, products, 
quality control measures, and responsible parties are indicated and are 
described further in the text. 
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and provide a comparison to the public geno 
sequence, which was reconstructed largely 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 

Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the —25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
complished by observing that a pair of reads, 
one of which is in one contig, and the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
reads into the final set of reported scaffolds. 
This set of unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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1 Assembly data sets 

e used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 million reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sizes of 2, 
10, and 50 kbp were used. By looking at how 
mate pairs from a library were positioned in 
known sequenced stretches of the genome, we 
were able to characterize the range of insert 
sizes in each library and determine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
erage achieved by the data set The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone that has sequence from both ends. The 
clone coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5. IX cover- 
age of the genome, and clone coverage was 
3.42X, 16.40X, and 18.84X for the 2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7 X clone coverage. 

The second data set was from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 
BAC data input to the assemblies came from a 
download of GenBank on 1 September 2000 
(Table 2) totaling 4443.3 Mbp of sequence. 
The data for each BAC is deposited at one of 
four levels of completion- Phase 0 data are a set 
of . generally unassembled sequencing reads 
from a very light shotgun of the BAC, typically 
less than IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 
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• SNPs 
— BAC Fragments 

Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and 
internally derived reads from five different individuals (black lines) are combined to produce a 
contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by usinc 
mate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) 
physical map information. 



*es. In the past 2 years the PFP has 
on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAC 
clone. 

We screened the bactig sequences for con- 
taminants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core (38), filtered for a 25-fep 
match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the sequence; (ii) the nonhuman portion, 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (39), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 6 1 .0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly 
processes (18). 

2.2 Assembly strategies 

Two different approaches to assembly were 
pursued. The first was a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering of.the bactigs. This 
resulted in 16.05 million "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (40). Furthermore, BAC location 
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information was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 

Table 2. Gen Bank data input into assembly. 



the Human genome 

at least 2.2% of the BACs contained sequence 
data that were not part of the given BAC (41), 
possibly as a result of sample-tracking errors 
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•Other centers contributing at least 0.1% of the sequence include: Chinese National Human Genome Center 
Cenomanalyse Gesellschaft fuer Biotechnologische Forschung mbH; Genome Therapeutics Corporation; GENOSCOPE; 
Chinese Academy of Sciences; Institute of Molecular Biotechnology; Keio University School of Medicine; Lawrence 
Livermore National Laboratory; Cold Spring Harbor Laboratory; Los Alamos National Laboratory; Max-Planck Institut fuer 
Molekulare. Genetik; Japan Science and Technology Corporation; Stanford University; The Institute for Genomic 
Research; The Institute of Physical and Chemical Research. Gene Bank; The University of Oklahoma; University of Texas 
Southwestern Medical Center, University of Washington. fThe 4.405,700.825 bases contributed by all centers were 
shredded into faux reads resulting in 2.96 x coverage of the genome. 



(see below). In short, we performed a true, ab 
initio whole-genome assembly in which ut 
took the expedient of deriving additional se- 
quence coverage, but not mate pairs, assembled 
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bactigs, or genome locality, from.somc extcr. 
nally generated data. 

In the compartmentalized shotgun assembly 
(CSA), Celera and PFP data were partitioned 
into the largest possible chromosomal segmcnti 
or "components''' that could be determined with 
confidence, and then shotgun assembly was ap- 
plied to each partitioned subset wherein the 
bactig data were again shredded into faux rcadi 
to ensure an independent ab initio assembly of 
the component. By subsetting the data in this 
way, the overall computational effort was re- 
duced and the effect of interchromosomal dupli- 
cations was ameliorated. This also resulted in a 
reconstruction of the genome that was relatively 
independent of the whole-genome assembly re- 
sults so that the two assemblies could be com- 
pared for consistency. The quality of the parti- 
tioning into components .was , crucial so that 
different genome regions were not mixed to- 
gether. We constructed components from (i) ihc 
longest scaffolds of the sequence from each 
BAC and (ii) assembled scaffolds of data unique 
to Celera's data set. The BAC assemblies were 
obtained by a combining assembler that used the 
bactigs and the 5 X Celera data mapped to those 
bactigs as input This effort was undertaken as 
an interim step solely because the more accurate 
• and complete the scaffold for a given sequence 
stretch, the more accurately one can tile these 
scaffolds into contiguous components on the 
basis of sequence overlap and mate-pair infor- 
mation. We further visually inspected and cu- 
rated the scaffold tiling of the components to 
further increase its accuracy. For the final CSA 
assembly, all but the partitioning was ignored 
and an independent, ab initio reconstruction of 
the sequence in each component was obtained 
by applying our whole-genome assembly algo- 
rithm to the partitioned, relevant Celera data and 
the shredded, faux reads of the partitioned, rel- 
evant bactig data. 

2.3 Whole-genome assembly 

The algorithms used for whole-genome as- 
sembly (WGA) of the human genome were 
enhancements to those used to produce the 
sequence of the Drosophila genome reported 

in detail in (28). . 

The WGA assembler consists of a pipcmi^ 
composed of five principal stages: Screcner. 
Overlapper, Unitigger, Scaffolder, and Rep"' 1 
Resolver, respectively. The Screcncr fimis 
and marks all microsatellite repeats with less 
than a 6-bp element, and screens out ui 
known interspersed repeat elements, mclu - 
ing Alu, Line, and ribosomal DNA. Murkcu 
regions get searched for overlaps, whercn. 
screened regions do not get searched, but c» 
be part of an overlap that involves unscrcenet 
matching segments. 
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The Overlapper compares every rea 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insist on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. This took 4 to 5 
days in elapsed time with 40 such machines 
operating in parallel. -. 

Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
. ing repeat-induced overlaps, especially early 
in the process. 

We achieve this objective in the Unitig- 
ger. We first find all assemblies of reads that 
. appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled con tigs) . Formally, these unitigs are 
the uncontested interval subgraphs of the 
graph of all overlaps (42).. Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too high to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6x simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive element 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
orientation, with respect to each other, the 
probability ■ of this being wrong is again 
roughly 1 in 10 10 , assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate pairs producing intermediate- 
sized scaffolds that are then recursively 
linked together by confirming 50-kbp mate . 
pairs and BAC end sequences. This process 
yielded scaffolds that are on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps. These scaffolds reconstruct 
the majority of the unique sequence within a : 
genome. 

For the Drosophila assembly, we engaged 
in .a three-stage repeat resolution strategy 
where each . stage was progressively more 
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ressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
the. probability of inserting a unitig into an 
incorrect gap with this strategy to be less than 
10~ 7 based on a probabilistic analysis. 
* We revised the ensuing "Stones" substage 
of the human assembly, making it more like 
the mechanism suggested in our earlier work 
(43). For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-pairing 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads within the 
gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
simulated shotgun data set of human chromor 
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Fig. 4. Architecture of Celera's two-pronged assembly strategy. Each oval denotes a computation 
process performing the function indicated by its label, with the labels on arcs between ovals 
describing the nature of the objects produced and/or consumed by a process. This figure 
summarizes the discussion in the text that defines the terms and phrases used. 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled BAC data that cover 
the gap. We call this external gap "walking." 
We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- 
spersed elements whose quality was only 
99.62% correct We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number of gaps of some- 
what larger size. 

At the final stage of the assembly process, 
and also at several intermediate points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present. In the event that no Celera 
data cover a given region, the BAC data 
sequence is used. 

A key element of achieving a WGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequence-con- 
structing subroutines. In addition, memory was 
a. real issue — a straightforward application of 
the software we had built for Drosophila would 
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have required a computer with a 600-gigabyte 
RAM. By making the Overlapper and Unitigger 
. incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of RAM. Moreover, the 
incremental nature of the first three stages al- 
lowed us to continually update the state of this 
part of the computation as data were delivered 
and then perform a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 
sired. For our assembly operations, the total 
compute infrastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) and a 1 co- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the assembly, 
numbered 11.27 million (26%), which is con- 
sistent with our experience for Drosophila. 
. More than 84% of the genome was covered by 
scaffolds > 100 kbp long, and these averaged 
91% sequence and 9% gaps with a total of 
2.297 Gbp of sequence. There were a total of 
93,857. gaps among the 1637 scaffolds >100 ... 
kbp. The average scaffold size was 1.5 Mbp, . 
the average contig size was 24.06 kbp, arid the . 
average gap size was 2.43 kbp, where the dis- 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 
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2.905,568,203 

2.653,979,733 
53.591 
170,033 
116,442 
72,091 
54,217 
15,609 
2,161 

1,988.321 
100 



Compartmentalized shotgun assembly 

2,748,892.430 2.700,489,906 



2,524,251,302 
2,845 
112,207 
109,362 
69,175 
966.219 
22.496 
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Whole-genome assembly 



2,491,538,372 
1,935 
107,199 
105,264 
67,289 
1,395.602 
23.242 
1.985 

1.988.321 
94 



tribution of each was essentially exponentiaL 
More than 50% of all gaps were less than 500 
bp long, >62% of all gaps were less than 1 kbp 
long, and no gap. was > 100 kbp long. Similar- 
ly, more than 65% of the sequence is in contigs 
>30 kbp, more than 31% is in contigs >100 
kbp, and the largest contig was 1.22 Mbp long. 
Table 3 gives detailed .summary statistics for 
the structure of this assembly with * a direct 
comparison to the compartmentalized shotgun 
assembly. 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA approach, we pur- 
sued a localized assembly approach that was 
intended to subdivide the genome into seg- 
ments, each of which could be shotgun as- 
sembled individually. We expected that this 
would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculating U-unitigs. The compart- 
mentalized assembly process involved clus- 
tering Celera reads and bacrigs into large, 
multiple megabase regions of the genome, 
and then running the WGA assembler on the 
Celera data and shredded, faux reads ob- 
tained from the bactig data. 

The first phase of the CSA strategy was to 
separate Celera reads into those that matched 
the BAC contigs for a particular PFP BAC 
entry, and those that did not match any public 
data. Such matches must be guaranteed to 
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properly place a Celera read, so all reads were 
* first masked against a library of common 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit. Of Celera's 27.27 million 
reads, 20.76 million matched a bactig and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig's 
BAC because their mate matched the bactig. 
Of the remaining reads, 2.92 million were 
completely screened out and so could not be 
matched, but the other 2.97 million reads had 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set. 
Because the Celera data are 5.1 1 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant 5x 
Celera reads and bactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups indicative of un- 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
have not been mapped to consistent positions 
are removed. Then all sets of mate pairs that 
consistently imply the same relative position 
of two bactigs are bundled into a link and 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result For Phase 
0 data, the average GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 
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flmbly took place, but not enough Celera 
ta were matched to truly assemble the 0.5 X 
to IX data set represented by the typical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confinnation of assem- 
bly, and localization of the Celera reads. The 
phase 0 data suggest that a combined whole- 
genome shotgun data set and 1 X light-shot- 
gun of BACs will not yield good assembly of 
BAC regions; at least 3X light-shotgun of 
each BAC is needed. 

. The 5.89 million Celera fragments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
sembly resulted in a set of scaffolds totaling 
442 Mbp in span and consisting of 326 Mbp 
of sequence. More than 20% of the scaffolds 
were >5 kbp long, and these averaged 63% 
sequence and 27% gaps with a total of 302 
Mbp of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
by the combining assembler to the subse- 
quent tiling phase. 

At this stage, we typically had one or two 
scaffolds for every BAC region constituting 
. at least 95% of the relevant sequence, and a 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to determine the order and over- 
lap tiling of these BAC and Celera-unique 
scaffolds across the genome. For this, we 
used Celera's 50-kbp mate-pairs information, . 
and BAC-end pairs (18) and sequence tagged 
site (STS) markers (44) to provide long- 
range guidance and chromosome separation. 
Given the relatively manageable number of 
scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
evidence for each. A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup- 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order to give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
remove chimeric content in a PFP data entry. 



Chid^Ror contaminating sequence (from 
anotffl^art of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps in. the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio assembly of the region. 

WGA assembly of the components result- 
ed in a set of scaffolds totaling 2;906 Gbp in 
span and consisting of 2.654 Gbp of se- 
quence. The chaff, or set of reads not incor- 
porated into the assembly, numbered 6.17 
million, or 22%. More than 90.0% of the 
genome was covered by scaffolds spanning 
>100 kbp long, and these averaged 92.2% 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105,264 gaps among the 107,199 contigs that 
belong to the 1940 scaffolds spanning >100 
kbp. The average scaffold size was i.4 Mbp, 
the average contig size was 23.24 kbp, and 
the average gap size was 2.0 kbp where each 
distribution of sizes was exponential. As 
such, averages tend to be underrepresentative 
of the majority of the data. Figure 5 shows a 
histogram of the bases in scaffolds of various 
size ranges. Consider also that more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were <1 kbp, and all 
gaps are < 100 kbp long. Similarly, more than 
73% of the sequence rs in contigs > 30 kbp, 
more than 49% is in contigs > 100 kbp, and 
the largest contig was 1.99 Mbp long. Table 3 
provides summary statistics for the structure 
of this assembly with a direct comparison to 
the WGA assembly. 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and, CSA), we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any uniqueness of the matching segments. 
Thus, another analysis was conducted in 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were confirmed by other matches having a 
consistent order and orientation. This gives 
some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 
measure. ~ 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overlap implied by the 
matching segments. An initial set of candi- 
dates was identified automatically, and then 
each candidate was inspected by hand. From 
this process, we identified 31 instances in 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly >. 
is in error and why. 

In addition, we evaluated local inconsis- 
tencies of order or orientation. The following 
results exclude cases in which one cpntig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of hundreds 
of base pairs and rarely > 1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2. 1 08 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 



50%-, 
45%- 
g 40% -I 
1 35% -I 
w 30% -j 
I 25% • 
*5 20% ■ 
| 15% - 
o 10% -I 

Q. 

5%-! 



0% 



The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
. assembly of a gigabase-sized problem. When 
one considers the increase of two-and-a-half 
orders of magnitude in problem size, the in- 
formation loss between the two is remarkably 
small. Because CSA was logistically easier to 
deliver and the better of the two results avail- 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. 

2.6 Mapping scaffolds to the genome 

The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 
CSA. These grouped scaffolds were reordered 
by examining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on having 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- 
ers. There are two genome-wide types of map 
information available: high-density STS maps 
and fingerprint maps of BAC clones developed 
at Washington University (45). Among the ge- 
nome-wide. STS maps, GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-range order, because the framework mark- 
ers were derived from well-validated genetic 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 
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In order to determine the effectiveness of 
the fingerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markerson the 10 
largest scaffolds (those >9 Mbp) were 
mapped on a different chromosome on 
GM99. Two percent of the STS markers dis- 
agreed in position by more than five frame- 
work bins. However, for : the fingerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold" sequence disagreed 
with fingerprint map placement by more than 
five BACs. When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10 
. scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller scaffolds had 
a higher discordance rate with GM99 (4.21% 
of STSs were discordant by more than five 
framework bins), but a lower discordance rate 
with the fingerprint maps (11% of BACs 
disagreed with fingerprint maps by more than 
five BACs). This observation agrees with the 
..clone coverage analysis (46) that Celera scaf- 
fold construction was better supported by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds agreed between GM99 and the 
WashU BAC map, we had a high degree of 
confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple mapped markers with . 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% : 
of which are also oriented (Table 4). Because ; 
GM99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be <4 unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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with GM99. These scaffolds were termed 
* "ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was ordered unambiguously. 

Next, all scaffolds that could be placed, 
but not ordered, between anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same BAC cannot be 
ordered relative to each other, but can be 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, —98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2%. of the genome, 
were distributed evenly across the genome. 
By dividing the sum of unmapped scaffold 
lengths with the sum of the number of 
mapped scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the 
chromosome. 

During the scaffold-mapping effort, we en- 
countered many problems that resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and . 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
116,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene definition processes more difficult. 
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^^B\ssembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness, Completeness is defined as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 
known with absolute certainty until the eu- 
chromatin sequence has* been completed. 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) coverage of the two 
published chromosomes, 21 and 22 (48, 49); 
and (iii) analysis of the percentage of an 
independent set of random sequences (STS 
markers) contained in the assembly. The . 
whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of unique sequence embedded in re- 
gions of heterochromatin as were observed in 
Drosophila (50, 51). 

The sequences of human chromosomes 21 
and 22 have been completed to high quality 
and published (48, 49). Although this se- 
quence served as input to the .assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the . 
opportunity to assemble it differently from 
the original sequence in the case of structural 
. polymorphisms or assembly errors in the 
BAC data. In particular, the assembler must 
be able to resolve repetitive, elements at the 
. scale of components (generally multimega- 
base in size), and so this comparison reveals 
the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
"finished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker, more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 



ness^^^ieasure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(57) to the scaffolds. Because these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5.4%) were found by searching the unas- 
sembled data or "chaff," We identified 1283 
STS markers (2.6%) not found in either Celera 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genome- 
wide survey. 

Correctness. Correctness is defined as the 
structural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
Celera data and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as-: 

Table 4. Summary of scaffold mapping. Scaffolds 
': were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and GM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map. GM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment The scaffold subcategories are given 
below each category. 
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sembly against other finished sequence for 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 
can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se- 
quencing reads should be located on the con- 
sensus sequence with the correct separation 
and orientation between the pairs. A pair is 
termed 'Valid" when the reads are in the 
correct orientation and the distance between 
them is within the mean ± 3 standard devi- 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined as described 
above. To validate these, we examined all 
reads mapped to the finished sequence of 
chromosome 21 (48) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- 
merism (two different segments of the ge- / 
nbme cloned into the same plasmid), and how 
tight the distribution of insert sizes was for 



those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp libraries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 
(—10%). Thus, although the mate-pair infor- 
mation was not perfect, its accuracy was such 
that measuring valid, misoriented, and mis-. 
. separated pairs with respect to a given assem- 
bly was deemed to be a reliable instrument 
for validation purposes, especially when sev- 
eral mate pairs confirm or deny an ordering. 

The clone coverage of the genome was 
39 X, meaning that any given base, pair was, 
on average, contained in 39 clones or, equiv- 
alently, spanned by 39 mate-paired reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly by valid mate pairs (Table 6). In . 
summary, for scaffolds >30 kbp in length, 
less than 1% of the Celera assembly was in 
regions of less than 3 X clone coverage. Thus, 
more than 99% . of the assembly, including 
order and orientation, is strongly supported 
by this measure alone. 

. We examined the locations and number of 
all misoriented and misseparated mates. In * : 
addition to doing this analysis on the CSA 
assembly (as . of 1 October 2000), we also , 
performed a study of the PFP assembly as of 



5 September 2000 (50, 55b). In this latter 
case, Celera mate pairs had to be mapped tb 
the PFP assembly. To avoid mapping errors 
due to high-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less than 
6% differences. A threshold was set such that 
sets of five or more simultaneously invalid 
mate pairs indicated a potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breakpoints. There were a 
similar (small) number of breakpoints on 
both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped reliably. Figures 
6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the two 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs in 
the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both, 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
simply because they span a larger segment of 
the genome. . The : graphic comparison be- 
. tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to of mate pairs tested). If the two mates had incorrect relative orienta- 
the published sequence of chromosome 21. Each mate pair uniquely tion or placement, they were considered invalid (number of invalid mate 
mapped was evaluated for correct orientation and placement (number pairs). 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of "gene bins," each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 



being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a full-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al . Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations . 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DC1. 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- . 
tion between mouse and human genomic 
DNA, similarity to human transcripts (ESTs 



The human genome 

and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 



man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
extracted, and the subsequences supported by 
any homology evidence were marked (plus 100 
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Fig. 7. Schematic view of the distribution of breakpoints and large gaps 
on all chromosomes. For each chromosome, the upper pair of lines 
represent the PFP assembly, and the lower pair of lines represent Celera's 



assembly. Blue tick marks represent breakpoints, whereas red tick marks 
represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome is indicated in black, and the chromosome numbers in red. 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
. different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 1 0 bases, but the 
external edge was allowed greater latitude to 
allow for 5' and 3' untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of "hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 

Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number (W) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology) f 


0.604 


0.884 


Genscan 


0.501 


0.633 



* Refers to those annotations produced by Otto using only 
the Sim4-polished RefSeq alignment rather than an evi- 
dence-based Genscan prediction. f Refers to those 
annotations produced by supplying all available evidence 
to Genscan, 



those that passed were promoted to Otto 
predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAIL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
Otto predicted 11,226 additional genes by 
means of sequence similarity. 

3.2 Otto validation 

To validate the Otto homology-based process 
.and the method that Otto uses to define the 
structures of known genes, we compared tran- 
scripts predicted by Otto with their correspond- 
-ing (and presumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 
was a unique SIM4 alignment (Table 7). In 
order to evaluate the relative performance of 
. Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto with 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
. dieted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto . 
uses to annotate known genes (Otto-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6. 1% of true RefSeq 
nucleotides were not represented in the Otto- 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
diction strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This final class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe- 
line. For these, there, was not sufficient 
sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
. which ~ 76,4 10 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap known genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for. the human gene 
complement. As seen in Table 8, if the re- 
quirement for other supporting evidence is 
made more stringent, this number drops rap- . 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to ~23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence, types — homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 



1320 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org 




confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on 
the basis of matches to known genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
idence. The 26,383 genes are illustrated along 
chromosome diagrams in Fig. 1. These are a 
very preliminary set of annotations and are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on the 
average span covered by RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons^ The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 
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port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

. Summary. This section describes several of 

i i - * 

the noncoding attributes of, the assembled; 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 




4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
most visible, element of the structure of 
the genome is the banding pattern produced 
by Giemsa stain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin (64). Much of this hetero- 
chromatin is highly polymorphic and con- 
sists of different families of alpha satellite 
DNAs with various higher order repeat 
structures (65). Many chromosomes have 
complex inter- and intrachromosomal du- 
plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, and Otto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) were tallied. These data.show the degree to which 
multiple Genscan predictions and/or Otto annotations were associated With a single RefSeq 
transcript. The zero dass for the Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



tStt SrS^f?*?^ UBnS X su PP° rte ? bv vario r s type* of evidence for °« 0 ™* novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface, set of genes selected for protein analysis; italic, total set of accepted de novo predictions) - 



Total 



Types of evidence 



No. of lines of evidence* 



Otto 



De novo 



No. of exons per 
transcript 



Number of 

transcripts 
Number of 

exons 
Number of 

transcripts . 
Number of 

exons 
Otto 
De novo 





Mouse 


Rodent 


Protein 


Human 


2S1 


£2 


&3 


S4 


17,969 


1 7,065 


14,881 


15,477 


16,374 


17 f 968f 


17,501 


15,877 


12,451 


141,218 


111,174 


89,569 


108,431 


118,869 


140,710 


127,955 


99,574 


59,804 


58,032 


14,463 


5,094 


8,043 


9,220 


21350 


8,619 


4,947 


1,904 


319,935 


48,594 


19.344 


26,264 


40,104 


79,148 


31,130 


17,508 


6,520 


7.84 
5.53 


5.77 
3.17 


6.01 
3.80 


6.99 
3.27 


7.24 
4.36 


7.81 
3.7 


7.19 
3.56 


6.00 
3.42 


4.28 
3.16 



co^sid e£ rL^22r?^L!^!^2^^ ^ ITJ 6 gCn0mi f ^* toUta *y » human ^ or c0NA - **** to rodent EST or cONA. and similarity to known proteins) 
considered to support gene predictions from the different methods. The use of evidence is quite liberal requiring only a partial match to a single exon of oredicted transcriot 
number includes alternative splice forms of the 17.764 genes mentioned elsewhere in the text 8 P transcript - 
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Examination of pericentromeric regions is 
ongoing. 

The remaining —80% of the genome, the 
euchromatic component, is divisible into G-, 
R- f and T-bands (57). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular level. T-bands are 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (68). Bemardi has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300 kbp. in length 
(69). Bemardi defined the L (light) isochores as 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene 
concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By exaniining 
contiguous 50-kbp windows of G+C content . 
across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was 
1078.6 kbp. The correlation between G+C 
content and gene density was also examined in 
50-kbp windows along the assembled sequence . 
(Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3-containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 
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found to have the lowest gene density, X, 4, 
1 8, 13, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
. bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be unusual in its 
H3 banding. 

How . valid is Ohno's postulate . (71) that 
-mammalian genomes consist of oases of genes 
in otherwise essentially empty deserts? It ap- 
pears that the human genome does indeed con- 
tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 
gene, then we see that 605 Mbp, or about 20% 
of the . genome, is in deserts. These are not 
uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13, 18, and X have 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- 
ing of genes. The distance metric, centimorgans , 
(cM), is based on the recombination rate be- 
tween homologous chromosomes during meio- 

Table 9. Characteristics of C+C in isochores. 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
to produce the ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely used in genome 
t and genetic analysis: the linkage map and the 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project 

We mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
3 -Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously documented (73). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates and the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tween males and females.. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
magnitude of . variability in recombination 
rate will depend on the size of the window 



Isochore 



C+C (%) 



Fraction of genome 



Fraction of genes 



Predicted" 



Observed 



Predicted' 



Observed 



H3 

H1/H2 
L 



>48 
43-48 
<43 



5 
25 
67 



9.5 
21.2 
69.2 



37 
32 
31 



24.8 
26.6 
48.5 



♦The predictions were based on Bemardi's definitions (70) of the isochore structure of the human genome. 



Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
in the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set 
19.7% of the tran- 
scripts have one or 
two exons, and 5.7% 



7000 i 



6000 - 



i2 5000 - 
a 



4000 - 



o 
tn 
c 

CO 



Z 3000 
o 



2000 - 



1000 - 



0 V 




No. of Otto 
transcripts 

No. of de novo + 
1 line of evidence 



8 9 10 11 

Number of exons per transcript 



1 



.rc -. ft , n ,n 



12 13 14 15 16 17 18 19 20 >20 



have more than 20. In the de novo set 493% of the transcripts have one or two exons, and 02% have more than 20. 
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examined. Unfortunately, too few mei^} 

crossovers have occurred in Centre d'Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
finer than about 3 Mbp. The next challenge 
will be to determine a sequence basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
such as in positional cloning projects. 

4.3 Correlation between CpC islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (78) and 
tissue-specific gene expression (79) 

Experimental methods have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
(74, 80) and an estimate of 499 CpG islands 
on human chromosome 22 (81), Larsen et 
al. (76) and Gardiner-Garden and Frommer 
(75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G+C 
content of >50% and a ratio of observed 
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versus expected frequency of CG dinucle- 
otide >0.6. 

It is difficult to make a direct compari- 
son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island 
with gene .starts, given a set of annotated ^ 
. genomic transcripts and the whole genome 
sequence. We have analyzed the publicly 
• available annotation of chromosome 22, as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et al. (76). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
higher , threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- 
marized in Table 13. CpG islands computed 
with method 1 predicted only 2.6% of the 
CSA sequence as CpG, but 40% of the gene 
starts (start codons) are contained inside a 



4>G island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table show the observed and expected 
average distance, respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
firming an association between CpG island 
and the first exon. 

We also looked at the distribution of CpG 
island nucleotides among various sequence 
classes such as intergenic regions, introns, 
exons, and first exons. We computed the 
likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
intron, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 
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4.4 Genome-wide repetitive elements 

The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet-^ 
itive sequence may be underrepreserited in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary. The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11 (continued). Relation among gene density (orange), G+C content dows. The percent of G+C nucleotides was calculated in 100-kbp 
(green), EST density (blue), and Alu density (pink) along the lengths of windows. The number of ESTs and Alu elements is shown per 100-kbp 
each of the chromosomes. Gene density was calculated in 1-Mbp win- window. 



5.1 Retrotransposition in the human a duplication event. The existence of both events in cellular biology. Identification of 
genome intron-containing and intronless forms of conserved intronless paralogs in the mouse 
Retrotransposition of processed mRNA genes encoding functionally similar or or other mammalian genomes should pro- 
transcripts into the genome results in func- identical proteins has been previously de- vide the basis for capturing the evolution- 
tional genes, called intronless paralogs, or scribed (84, 85). Cataloging these evolu- ary chronology of these transposition 
inactivated genes (pseudogenes). A paralog tionary events on the genomic landscape is events and provide insights into gene loss 
refers to a gene that appears in more than of value in understanding the functional and accretion in the mammalian radiation, 
one copy in a given organism as a result of consequences of such gene-duplication A set of proteins corresponding to all 901 
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Otto-predicted, single-exon genes were su_ 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instances of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
full-length genes at the stringency specified 
and were verified by manual inspection. 

.We believe, that these 97 cases may rep- 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/fuIl/291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84, 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single 
source chromosome to multiple target chro- 
mosomes. Interesting examples include the 
retrotransposition of a five exon-containing . 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of genes involved in translatidnal 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue.-specific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 




The Human Genome 

5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 
Table 11. Genome overview. 



Ssed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Size of the genome (including gaps) 
Size of the genome (excluding gaps) 
Longest contig 
: Longest scaffold 
Percent of A+T. in the genome 
Percent of C+C In the genome 
Percent of undetermined bases in the genome 
Most CC-rich 50 kb 
Least CC-rich 50 kb 
Percent of genome classified as repeats 
Number of annotated genes 
Percent of annotated genes with unknown function 
Number of genes (hypothetical and annotated) 
Percent of hypothetical and annotated genes with unknown function 
Gene with the most exons 
Average gene size 
Most gene-rich chromosome 
Least gene-rich chromosomes 

Total size of gene deserts (>500 kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by introns 
Percent of base pairs in intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest, proportion of DNA in annotated exons 
Longest intergenic region (between annotated + hypothetical genes) 
Rate of SNP variation 



2.91 Cbp 
2.66 Cbp 
1.99 Mbp 
. 14.4 Mbp 
54 
38 
9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26,383 
42 

39,114 
59 

Titin (234 exons) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y (5 genes/Mb) 
605 Mbp 
25.5 to 37.8* 
1.1 to 1.4* 

24.4 to 36.4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (0.36) 
Chr. 13 (3,038,416 bp) 
1/1250 bp 

•In these ranges, the percentages correspond to the annotated gene set (26, 383 genes), and the hypothetical + 
annotated gene set (39,114 genes), respectively. r . 



Table 12. Rate of recombination per physical distance (cM/Mb) across the genome. Genethon markers 
were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA, not applicable. 







Mate 






Sex-average 






Female 




Chrom. 


















Max. 


Avg. 


Min. 


Max. 


Avg. 


Mini 


Max 


Avg. 


Min. 


1 


2.60 


1.12 


0.23 


2.81 


1.42 


0.52 


3.39 


1.76 


0.68 


2 


2.23 


0.78 


0.33 


2.65 


1.12 


0.54 


3.17 


1.40 


0.61 


3 


2.55 


0.86 


0.23 


2.40 


1.07 


0.42 


2.71 


130 


0.33 


4 


1.66 


0.67 


0.15 


2.06 


1.04 


0.60 


2.50 


1.40 


0.77 


5 


2.00 


0.67 


0.18 


1.87 


1.08 


0.42 


2.26 


1.43 


0.62 


6 


1.97 


6.71 


0.28 


2.57 


1.12 


0.37 


3.47 


1.67 


0.64 


7 


2.34 


1.16 


0.48 


1.67 


1.17 


0.47 


2.27 


1.21 


0.34 


8 


1.83 


0.73 


0.14 


2.40 


1.05 


0.46 


3.44 


136 


0.43 


9 


2.01 


0.99 


0.53 


1.95 


1.32 


0.77 


2.63 


'1.66 


0.82 


10 


3.73 


1.03 


0.22 


3.05 


1.29 


0.66 


2.84 


1.51 


0.76 


11 


1.43 


0.72 


0.31 


2.13 


0.99 


0.47 


3.10 


132 


0.49 


12 


4.12 


0.76 


0.26 


3.35 


1.16 


0.49 


2.93 


1.55 


0.59 


13 


1.60 


0.75 


0.01 


1.87 


0.95 


0.17 


2.49 


1.19 


032 


14 


3.15 


0.98 


0.18 


2.65 


1.30 


0.62 


3.14 


1.63 


0.75 


15 


2.28 


0.94 


0.34 


231 


1.22 


0.42 


2.53 


1.56 


0.54 


16 


1.83 


1.00 


0.47 


2.70 


1.55 


0.63 


4.99 


232 


1.12 


17 


3.87 


0.87 


0.00 


3.54 


1.35 


0.54 


4.19 


1.83 


0.94 


18 


3.12 


1.37 


0.86 


3.75 


1.66 


0.43 


435 


2.24 


0.72 


19 


3.02 


0.97 


0.10 


2.57 


1.41 


0.49 . 


2.89 


1.75 


0.87 


20 


3.64 


0.89 


0.00 


2.79 


1.50 


0.83 


3.31 


2.15 


1.34 


21 


3.23 


1.26 


0.69 


2.37 


1.62 


1.08 


2.58 


1.90 


1.18 


22 


1.25 


1.10 


0.84 


1.88 


1.41 


1.08 


3.73 


2.08 


0.93 


X 


NA 


NA 


NA 


NA 


NA 


NA 


3.12 


1.64 


0.72 


Y 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


NA 


Genome 


4.12 


0.88 


0.00 


3.75 


1.22 


0.17 


4.99 


1.55 


032 
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that account for gene inactivation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the 
functional counterparts, a poly(A) tract at the 
3' end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of retrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the. genomic se- 
quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted 
transcripts were excluded from this analysis. 
We identified 2909 regions matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

We looked for correlations between 
structural elements and the propensity for 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed 



pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
content did not show any significant differ- 
ence, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased occurrence of 
: retrotransposition (both intronless paralogs 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the 
whole genome (2.9-Cbp sequence length) by means of two different methods. Method 1 uses a CC 
likelihood ratio of £=0.6. Method 2 uses a CG likelihood ratio of s=0.8. 



Chromosome 22 



Whole genome 
(CS assembly) 





Method 1 


Method 2 


Method 1 


Method 2 


Number of CpG islands 


5.211 


522 


195,706 


26,876 


detected 






Average length of island (bp) 


390 


535 


395 


497 


Percent of sequence 


5.9 


0.8 


2.6 


0.4 


predicted as CpG 






Percent of first exons that 


44 


25 


42 


22 


overlap a CpG island 






Percent of first exons with 


37 


22 


40 


21 


first position of exon 






contained inside a CpG 










island 










Average distance between 


1,013 


10,486 


2,182 


17,021 


first exon and closest CpG 




island (bp) 










Expected distance between 


3,262 


32,567 


7,164 


55.811 


first exon and closest CpG 




island (bp) 











TaWe U ' DistriDuti °n of repetitive DNA in the compartmentalized shotgun assembly sequence. 



Repetitive elements 



Alu 

Mammalian interspersed repeat (MIR) 
Medium reiteration (MER) 
Long terminal repeat (LTR) 
Long interspersed nucleotide element 
(LINE) 

Total 



".; Megabases in 
assembled 
sequences 


Percent 
of 

assembly 


Previously 
predicted 
(%) (83) 


288 


9.9 


10.0 


66 


2.3 


1.7 


50 


1.7 


1.6 


155 


5.3 


5.6 


466 


16.1 


16.7 


1025 


35.3 


35.6 



The complete clusters that result from the 
Lek clustering provide one basis for compar- 
ing the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each complete cluster rep- 
resents a closed and certain island of homol- 
. ogy, and because Lek is capable of simulta- 
neously clustering protein complements of 
several organisms, the number of proteins 
contributed by each organism to a complete 
cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
ism's contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 
ative . importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
. pansion and contraction of protein families, 
presumably as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu- 
man as compared with D. melanogaster and 
Caenorhabditis elegans proteins in complete 
clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1:1 in the ratio for 
human-worm or human-fly clusters with the 
slope spread covering both human and fly/ 
worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 



5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family- based method that identified highly 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termined to be in the same family and th^Hltering methods, a shuffled protein set was 
same complete Lek cluster (essentiall^first created by taking the 26,588 proteins, 




paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes 
ordered by the start codons for predicted 
genes along the chromosome. We considered 
the two strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek complete cluster (89). All 
pairs of . indexed gene '. strings were then 
aligned in both the forward and reverse di- 
rections with the Smith-Waterman algorithm 
(90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch -10, with gap open 
and extend penalties of -4 and -1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
very rapidly; for example, two chromosomes 
of 100 Mbp can be aligned in less than 20 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidopsis, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were concat- 
enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 



randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular, every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
* real and the shuffled data, with the results on 
the shuffled data .being used to estimate the 
false-positive rate. The algorithm after filter- 
ing yielded 10,310 gene pairs in 1077 dupli- 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled data, by con- 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 
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t several evolutionary stages (94). The 

figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
others. Indeed, one of the largest duplicated 
segments is a large block of 33- proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and 14 panels in Fig. 13). 
The proteins are not contiguous but span a 
region containing 97 proteins on chromo- 
some 2 and 332 proteins on chromosome 14. 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 2.3 X ICT 68 (93). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 
. large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset). This duplication 
contains 64 detected ordered intrachromo- 
somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krap rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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Fig. 12. Gene duplication in complete protein clusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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By this measure, the duplication segment 
spans nearly half of each chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 

would need to be invoked to explain the duplication in fact best explains many of the 
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pair of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested 

. Evaluation of the alignment results gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 



relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
some 20, for a density of involved proteins of 
20 to 30%. -This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of , 
the pairs of aligning proteins in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families of para- 
logs; their relative scarcity within the genome 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob- 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.sciencemag.org/cgi/con- 
tent/full/291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulatob 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



blocks detected by this genome-wide analysis. 
The, regions of human chromosomes involved 
in the large-scale duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse 
, chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
their human synteny partners than the human 
duplication regions are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or- 
thologous to the human genes on which the 
human duplication assignments were made. On 
the basis, of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications . 
, appear to predate the two species* divergence. 
This dates the duplications, at the latest, before 
divergence of the primate and rodent lineages. 
This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, pufferflsh (Fugu 
rubripes), or zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions^ , 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



veal the stagewise history of our genome, and 
with it a history of the emergence of many nf 
the key functions that distinguish us from oth» 
living things. 

6 A Genome-Wide Examination of 
Sequence Variations 

Summary. Computational methods were used 
to identify single-nucleotide polymorphisms 
(SNPs) by comparison of the Celera sequence 
to other SNP resources. The SNP rate be- 
tween two chromosomes was -1 per 1200 to 
1500 bp. SNPs are distributed nonrandomly 
throughout the genome. Only a very small 
proportion of all SNPs (<1%) potentially 
impact protein function based on the func* 
tional analysis of SNPs that affect the pre- 
dieted coding regions. This results in an c». 
timate that only thousands, not millions, of 
genetic variations may contribute to the struc- 
tural diversity of human proteins. 

. Having a complete genome sequence enables 
researchers to achieve a dramatic acceleration 
in the rate of gene discovery, but only through 
. analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
. sequencing is a particularly effective method 
for detecting sequence variation in tandem with 
whole-genome assembly. In addition, we com- 
pared the distribution and attributes of SNP* 
ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence to the 
PFP assembly, (ii) overlap of high-quality reads 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (P7), and (iii) reduced repre- 
sentation shotgun sequencing (referred to as 
*TSC"; 632,640 SNPs) (98). These data were 
. consistent in showing an overall nucleotide di- 
versity of —8 X 10~ 4 , marked heterogeneity 
across the genome in SNP density, and an 
overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details ncccs- . 
sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified; 
these were then filtered to reduce the contribu- 
tion of sequencing errors and misassembly. As 
a measure of the effectiveness of the filtering 
step, we monitored the ratio of transition and 
transversion substitutions, because a 2:1 ratio 
has been well documented as typical in mam- 
malian evolution (100) and in human SNPs 
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(101, 102). The filtering steps consisted of re- 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
tion-to-transversion ratio from 1.57:1 to 
1.89:1. When applied to 2.3 Gbp of alignments 
between the Celera and PFP consensus se- 
quences, these filters resulted in identification 
of 2,104,820 putative SNPs from a total of 
2,778,474 substitution differences. Overlaps 
between this set of SNPs and those found by 
other methods are described below. 
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6.2 Comparisons to public SNP 
databases 

Additional SNPs, including 2,536,021 from 
dbSNP (www.ncbi.nlm.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast (103). The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of -the dbSNP sequence and align- 
ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated. dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were discarded A total of 
2^36,935 dbSNP variants were mapped to 
1,223,038 unique locations on the Celera se- 
quence, implying considerable redundancy in 
dbSNP. SNPs in the TSC set mapped to 
5$5,8 1 1 unique genomic locations, and SNPs in 
the Kwok set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used 
in this analysis, including Celera-PFP, TSC 
and Kwok, is 2,737,668. Table 15 shows that a 
substantial fraction of SNPs identified by one of 
these methods was also found by another meth- 
od The very high overlap (36.2%) between the 
Kwok and Celera-PFP SNPs may be due in part 
to the use by Kwok of sequences that went into 
the PFP assembly. The unusually low overlap 
(16.4%) between the Kwok and TSC sets is due 



I?,*!, i O™*** of SNPs from genome-wide 
SNP databases. Table entries are SNP counts for 
each pair of data sets. Numbers in parentheses are 
the fraction of overlap, calculated as the count of 
overlapping SNPs divided by the number of SNPs 

t ^ e cf!U aUer ° f the ^ Abases compared. 
Total SNP counts for the databases are: Celera- 
PFP, 2,104.820; TSC, 585,811; and Kwok 438,032 
Only unique SNPs in the TSC and Kwok data sets 
were included. 



o their being the smallest two sets. In addition, 
24.5% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). . SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro- 
vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of assessing whether the. 
three sets of SNPs provide the same picture 
of human variation . is to tally the frequen- 
cies of. the six, possible base changes in 
each set of SNPs (Table 16). Previous mea- 
sures of nucleotide diversity were mostly 
derived from small-scale analysis on can- 
didate genes (101), and our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale. 
There is remarkable homogeneity between 
the SNPs found in the Kwok set, the TSC 
set, and in our whole-genome shotgun (46) 
in. this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 transition-to- 
transversion ratio observed in the other 
SNP sets. This result is not unexpected, 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2 : 1 transitionrtransversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presumably random) sequence errors. 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used ir, the 
standard statistic for nucleotide diversity 
(104). Nucleotide diversity is a measure, of 
per-site heterozygosity, quantifying . the 
probability that a pair of chromosomes 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
in methods like reduced respresentation se- 
quencing, we need to know the sequence 
quality and the depth of coverage at each 



^fc*hese data are not readily available, so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity from high-quality sequence 
overlaps .should be possible, . but again, 
more information is needed on the details 
of all the alignments. 

Estimation of nucleotide diversity from a 
shotgun assembly entails calculating for each 
• column qf the multialignment, the probability 
that two or- more distinct alleles are present, 
and the probability of detecting a SNP if in 
fact the alleles have different sequence (i.e., 
the probability of correct sequence calls). The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP (105). Even 
after correcting for variation in coverage, the 
nucleotide diversity appeared to vary across 
autosomes. The significance of this heteroge- 
neity was tested by analysis of variance, with 
. estimates of it for 100-kbp windows to esti- 
mate variability within chromosomes (for the 
Celera-PFP comparison, F = 29 73 P < 
0.0001). ' * 

Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X 10" 4 . Nucleotide diversity on 
the X chromosome was 6.54 X 10" 4 . The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 
drift will more rapidly remove variation 
from the X (106). 

Having ascertained nucleotide variation 
genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (101, 102, 106, 107). Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X lO" 4 for the Celera-PFP alignment, 
and a published estimate averaged over 10 
densely resequenced human genes was 
8.00 X 10" 4 (108). 



6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- 



Table 16. Summary of nucleotide changes in different SNP data sets. 



SNP data set 




188,694 
(0.322) 



158,532 
(0.362) 

72,024 
(0.164) 
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TSCf 



A/C 
(%) 


C/T 
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C/G 
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Transition: 
transversion 



9.2 
8.6 
8.6 



10.3 
8.4 
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1.59:1 
2.07:1 
1.99:1 
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Rg. 13. Segmental dupO_ 
tions between chromo- 
somes in the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10310 
pairs of genes in total Each 
line represents a pair of ho- 
mologous genes belonging 
to a block; all blocks con- 
tain at least three genes 
on each of the chromo- 
somes where they appear. 
Each panel shows all the 
duplications between a 
single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
dose-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs showa 
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somes, and whether this heterogeneity is 
greater than expected by chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a 
mathematical formulation called the neutral 
coalescent (109). Applying well-tested algo- 
rithms for simulating the neutral coalescent 
with recombination (110% and using an ef- 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation 
rate (111), we generated a distribution of num- 
bers of SNPs by this model as well (112). The 
observed distribution of SNPs has a much larg- * 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant 
variability across the genome in SNP density, 
an observation that begs an explanation. 

Several attributes of the DNA sequence 
may affect the local density of SNPs, in- 
cluding the rate at which DNA. polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 
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otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
but G+C content accounted for only a 
small part of the variation. 

6.5 SNPs by genomic class 

.-To . test homogeneity of SNP densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription unit), 5'- 
UTR, exonic . (missense and silent), in- 
tronic, and 3MJTR for , 10,239 known 
genes, derived from the NCBI.RefSeq da- 
tabase and all human genes predicted from 
-the Celera Otto annotation. In coding re-- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (112). These ratios are com- 
parable to the missense-to-silent ratios of . 
0.88 and 1.17 found by Cargill et ah (101) 
and by Halushka et al. (102). Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about 
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Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red. Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
counts in Celera-PFP, -TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smaller frac- 
tion of missense SNPs (47, 41, and 40% in 

: Celera-PFP, Kwok, and TSC). Intergenic re- 
gions have been virtually unstudied (113), and 
we note that 75% of the SNPs we identified 
were intergenic (Table 17). The SNP rate was 

. highest in introns and lowest in exons. The SNP 
rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 
between these two classes of DNA. These SNP 
rates were confirmed in the Celera SNPs, which 
also exhibited a lower rate in exons than in 
introns, and in extragenic regions than in in- 
trons (46). Many of these intergenic SNPs will 

- provide valuable information in the form of 
markers for linkage and association studies, and 
some faction is likely to have a regulatory 
function as well. 

7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 
prominent differences and similarities 
when the human genome is compared with . 
other fully, sequenced eukaryotic genomes. 
Over 40% of the predicted protein set in 
humans cannot be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain- based 
analysis provides a detailed catalog of the 
prominent differences in the human ger 
nome when compared with the fly and 
worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prelimi- 
nary and are subject to several limitations. 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
been built, annotated, and reviewed by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
tions (some human genes will not be.computa- 
tionally predicted). We also expect errors in 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken from the set of. 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized with current classification meth-. 
ods? (ii) What are the core functions that 
appear to be common across the animals? 
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(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have at least 
two lines of supporting evidence. . About 
41% (12,809) of the; gene products could 
not be classified from this initial analysis 
and are termed .proteins with unknown 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similarity to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third of these 636 predicted 
genes represented endogenous retroviral pro- 
teins, further suggesting that the majority of 




these unknown-function genes are not real 
genes. Given that most of these additional 
12,095 genes appear to be unique among the 
genomes sequenced to date, many may simply 
.represent false-positive gene predictions. 
The most common molecular functions are 
/ the transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme). 
Other functions that are highly represented in 
the human genome are the receptors, kinases, 
and hydrolases. Not surprisingly, most of the 
hydrolases are proteases. There are also many 
proteins that are members of proto-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 
cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 
phosphatases. 

Table 17. Distribution of SNPs in classes of 
genomic regions. 



Genomic region 
class 


Size of 
region 
examined 
(Mb) 


Celera-PFP 
SNP 
density 
(SNP/Mb) 
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2185 
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Gene (intron + 
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First intron 
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Exon 


31 
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First exon 


10 
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viral protein (1 00, 0.3%), 
transfcf/cairicr protein (203, 0.7%) 
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Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene " Ontology 
(CO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 



Panther categories 
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7.2 Evolutionary conservation of core 
processes 

Because of the various "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of the evolution of the human ge- 
nome. The genomes of S. cerevisiae Cbak- 
ers* yeast") (118) and two diverse inverte- 
brates, C. elegans (a nematode worm) (119) 
and D. melanogaster (fly) (26), as well as the 
first plant genome, A. thaliana, recently com- 
pleted (92), provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
served between human and fly, and between 
human and worm (Fig. 16) to address the 
question, What are the core functions that, 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protein set"), and therefore are likely 
to perform similar conserved functions in the 
different organisms. It is critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by 
a duplication event) because paralogs may 
subsequently diverge in function. Following 
the yeast-worm ortholog comparison in 



the Human Genome 

(120), we identified two different cases for 
each pairwise comparison (human-fly and 
• human-worm). The first case was a pair of 
genes, one from each organism, for which 
.there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no 
additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes with 
more than one member in either or both of the 
.organisms being compared. Chervitz et ai 
(120) deal with this case by analyzing a 
phylogenetic tree that described the relation- 
.ships between all of the sequences in both 
. organisms, and. then looked for pairs of genes 
that were nearest neighbors in the tree. If the 
nearest-neighbor pairs, were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from 
pairwise sequence comparison without hav-, 
ing to examine a phylogenetic tree (see leg- 
end to Fig. 16). If the nearest neighbors are 
not from different organisms, there has been . 
a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 
tational overview . of the predicted human pro- . . 
tein set, we could not answer this question for 
every predicted protein. Therefore, we con- 



sider only "strict orthologs," i.e., the proteins 
with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
strict human-fly orthologs, 2031 human- 
worm (1523 in common between "these sets), 
,We define the evolutionarily conserved set as 
those 1523 human proteins that have strict 
r< orthologs in. both .D. melanogaster and C 
elegans. . 

. The distribution of the functions of the 
conserved protein set is shown in Fig. 16. 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
..not distributed among molecular functions in 
the same way as the whole human protein set. 
Compared with the whole human set (Fig. 
15), there are several categories that are over- 
1 represented in the conserved set by a factor of 
—2 or more. The first category is nucleic acid 
• enzymes, primarily the transcriptional ma- 
chinery (notably DNA/RNA methyltrans- 
ferases, DNA/RNA polymerases, helicases, 
: DNA ligases, DNA- and RNA-processing 
factors, nucleases, and ribosomal proteins). 
The basic transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved among the animals. 
Other enzyme types are also overrepresent- 
ed (transferases, oxidoreductases, ligases, 
lyases, and isomerases). Many of these en- 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BLASTP P-value of <10 -10 
(120), and (ii) has a more sig- 
nificant BIASTP score than 
any paralogs in either organ- 
ism, i.e., there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of 
orthologs By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 
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zymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
resented in the shared protein set. Proteases 
form the largest part of this category, and 
several large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also overrepresented in the con- 
served set. The major conserved families are 
small guanosine triphosphatases (GTPases) 
(especially the Ras-related superfamily, in- 
cluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The 
most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
thologs difficult within the members of con- . 
served protein families. 
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7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 

To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein 
domains. 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic. genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 
nases) and superfamilies (defined by shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote genomes. 
We have found that the most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. One of the most 
striking differences between the human ge- 
nome and the Drosophila or C. elegans ge- 
nome is the appearance of genes involved in 
acquired immunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs 
in vertebrates. We observe 22 class I and 22 
class II ; major vvhistocomparibility complex- 
(MHC) antigen genes and 1 14 r other immu- 
noglobulin genes in -the .human genome. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to constitute molecules such as 
MHC, and of the integrin fold to form several 
of the cell adhesion molecules that mediate . 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family, of secreted 4-aIpha helical 
bundle proteins, namely the cytokines and 
chemokines. Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal transduction .are also 
features that are poorly represented in the fly 
and worm. These include protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors of activated STATs (PIAS). In contrast, . 
many of the animal-specific protein domains 
that play a role in innate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 

Neural development, structure, and 
function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmin. 
These observations correlate well with the 
known phenotypic differences between the 
nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
ment. Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins (122) exist only in hu- 
mans. These proteins, which are not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 
of intercellular channels and the structural 
basis for electrical coupling.' Pathway find- 
ing by. axons and neuronal network forma-, 
tion is mediated through a subset of ephrins 
. and their cognate receptor tyrosine kinases 
that act as positional labels to establish 
topographical projections (123). The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2 
* in the worm) and their receptors (neuropi- 
, Iins and plexins) is that of axonal guidance 
molecules (124). Signaling molecules such 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(125). Notch receptors and ligands play 
important roles in glial cell fate determina- 
tion and gliogenesis (126). 

Other human expanded gene families play 
key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca 2+ sensor (or receptor) during synaptic 
vesicle fusion and release (127). Of interest is 
the increased co-occurrence in humans of 
PDZ and the SH3 domains in neuronal- 
specific adaptor molecules; examples include 
proteins that likely modulate channel activity 
at synaptic junctions (128).. We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
. (related to cyclic nucleotide gated channels), 
the voltage-gated calcium/sodium channel 
family, the inward-rectifier potassium chan- 
nel family, and the, voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and v 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin P0 is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18. Domain-based comparative analysis of proteins in H. sapiens (H), 
D. metanogaster (F), C elegans (W), S. cerevisiae (Y), and 4. thatiana (A). The 
predicted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as well as the total number of domains 
(in parentheses) are shown in each columa Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed in 



more than one cellular process. Results of the Pfam analysis may differ from 
results obtained based on human curation of protein families, owing to the 
limitations of large-scale automatic classifications. Representative examples 
of domains with reduced counts owing to the stringent E value cutoff used for 
this analysis are marked with a double asterisk (**). Examples include short 
divergent and predominantly alpha-helical domains, and certain classes of 
cysteine-rich zinc finger proteins. , 
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Domain description 
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Accession 
number 



Domain name 



PF00594 



PF00711 
PF00748 
PF00666 
PF00129 

PF00993 
PF00969 
PF00879 
PF01109 



Gla 



Defensin_beta 
Calpainjnhib 
Cathelicidins 
MHCJ 

MHCJI.atpha** 
MHCJI.beta** 
Defensin_propep 
GMCSF 
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• — 


PF00143 


Interferon 
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• t w w / bw 
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1 b w 
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RasQAP 
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PF00169 


PH 


PF00130 


DAG_PE-bind 


PF00388 


PI-PLC-X 


PF00387 


PI-PLC-Y 


PF00640 


PID 


PF02192 


PI3iep85B 


PF00794 


PI3K_rbd 


PF01412 


ArfGAP 


PF02196 
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PF02145 


Rap__GAP 


PF00788 


RA 


PF00071 


Ras 


PF00617 


RasGEF 


PF00615 


RGS 


PF02197 


Rlla 



Domain description 



Vitamin K-dependent carboxylation/gamma- 
•carboxyglutamic (GLA) domain 

Immune response 

Beta defensin 

Calpain inhibitor repeat • 

Cathelicidins . 

Class I histocompatibility antigen, domains alpha .1 
and 2 

Class II histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleukin-10 

lnterleukin-15 

lnterteukin-2 

lnterteukin-4 

lnterleukin-5 

lnterleukin-7/9 family 

lnterleukin-1 

lnterleukin-1 propeptide 

lnterleukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia inhibitory factor (LIF)/oncostatin (OSM) 

family 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecrine/chemokine), 

interleukin-8 like 
TIR domain 

TNF (tumor necrosis factor) family . 
Trefoil (P-type) domain 

Pl-PY-rho CTPase signaling 

BTK motif 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Egl-10, and 

Pleckstrin (DEP) 
FYVE zinc finger 
GDP dissociation inhibitor 
G-protein alpha subunit 
G-protein gamma like domains 
GTPase-activator protein for Ras-like GTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase 

Immunoreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/diacylglycerol binding domain (C1 
domain) 

Phosphatidylinositol-specific phospholipase C, X 
domain 

Phosphatidylinositot-specific phospholipase C, Y 
domain 

Phosphotyrosine interaction domain (PTB/PID) 
PI3-kinase family, p85-binding domain 
PI3-kinase family, ras-binding domain 
Putative GTP-ase activating protein for Arf 
Raf-like Ras-binding domain * 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PICA R-subunit 



LI 

H 


F 


w 


Y 


A 


11 


0 


0 


0 


0 


1 


0 


0 


o 


o 


. 3(9) 


0 


0 


0 


o 

w 


2 


0 


0 


0 


o 


.18(20) 

* 


: 0 


0 


0 


0 

w 


5(6) 


0 


0 


o 


o 


7 


0 


0 


o 


o 

w 


3 


0 


0 


0 


o 


1 


0 


0 


0 


o 


381 (930J 


125(291) 


67 (323) 


o 


o 


7(9) 


0 


0 


o 


o 

w 


1 


0 


0 


o 


o 


1 


0 


0 


o 


o 


1 


0 


0 


o 


o 


1 


0 


0 


o 


o 


1 


0 


0 


o 


o 

w 


1 


0 


0 


0 


0 


1 


0 


0 


0 


0 


7 


0 


0 


o 


o 


1 


0 


0 


0 


o 


1 


0 


0 


o 


o 

w^ 


2 


0 


0 


o 


o 

w 


2 


0 


0 


0 


0 


2 


0 


0 


0 


0 


2 


0 


0 


0 


0 


4 


0 


0 


0 


0 


32 


0 


0 


0 


0 



18 


8 


2 


b 


131 (143) 


■■• 12 


0 


0 


o 


0 


5(6) 


0 


2 


0 


0 


5 


1 


0 


0 


0 


73 (101) 


32 (44) 


24(35) 


6(9) 


66 (90) 


9 


4 


7 


0 


6 


10 


8 


8 


2 


11(12) 


12(13) 


4 


10 


5 


2 


28 (30) 


14 


15 


5 


15 


6 


2 


1 


1 


3 


27(30) 


.10 


20 (23) 


2 


5 


16 


5 


5 


1 


0 


11 


5 


8 


3 


0 


9 


2 


3 


5 


0 


12 


8 


7 


1 


4 


3 


0 


0 


0 


0 


193(212) 


72(78) 


65(68) 


24 


23 


45(56) 


. 25(31) 


26(40) 


1(2) 


i ■ 4 


12 


3 


7 


1 


8 


11 


2 


7 


1 


8 


24 (27) 


13 


11(12) 


0 


0 


2 


1 


1 


0 


0 


6 


3 


1 


0 


0 


16 


9 


8 


6 


15 


6(7) 


4 


1 


0 


0 


5 


4 


2 


0 


0 


18(19) 


7(9) 


6 


1 


0 


126 


56(57) 


51 


23 


78 


21 


8 


7 


5 


0 


27 


6(7) 


12(13) 


1 


0 


4 


1 


2 


1 


0 



www3ciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001 



1339 



The Human genome 



Table 18 [Continued) 



Accession 
number 

PF00620 
PF00621 
PF00536 
PF01369 
PF00017 
PF00018 
PF01017 
PF00790 
PF00568 

PF00452 
PF02180 
PF00619 
PF00531 
PF01335 
PF02179 
PF00656 
PF00653 

PF00022 
PF00191 
PF00402 
PF00373 
PF00880 
PF00681 
PF00435 
PF00418 
PF00992 
PF02209 
PF01044 



Domain name 



Domain description 



H 



RhoCAP 

RhoCEF 

SAM 

Sec7 

SH2 

SH3 

STAT 

VHS 

WH1 

6ct-2 

BH4 

CARD 

Death 

DED 

BAG 

ICE_p20 

BIR 

Actin 

Annexin 

Calponin 

Band_41 

Nebulin_repeat 

Plectin.repeat 

Spectrin 

Tubutin-binding 

Troponin 

VHP " 

Vinculin 



PF01391 


Collagen 


PF01413 


C4 


PF00431 


CUB 


PF00008 


ECF 


PF00147 


Fibrinogen^ 


PF00041 


Fn3 


PF00757 


Furin-like 


PF00357 


Integrin.A 


PF00362 


lntegrin_B 


PF00052 


Laminin_B 


PF00053 


Laminin_EGF 


PF00054 


Laminin_C 


PF00055 


Laminin.Nterm 


PF00059 


Lectin_c 


PF01463 


LRRCT 


PF01462 


LRRNT 


PF00057 


LdLrecept.a 


PF00058 


LdLrecept b 


PF00530 


SRCR 


PF00084 


Sushi 


PF00090 


TspJ 


PF00092 


Vwa 


PF00093 


Vwc 


PF00094 


Vwd 



PF00244 

PF00023 

PF00514 

PF00168 

PF00027 

PF01556 

PF00226 

PF00036 

PF00611 

PF01846 

PF00498 



14-3-3 
Ank 

Armadillo seg 
C2 

cNMPJ>inding 

DnaJ_C 

DnaJ 

Efhand** 

FCH 

FF 

FHA 



RhoGAP domain 
RhoGEF domain 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
VHS domain 
WH1 domain 

Domains involved in apoptosis 

Bd-2 

Bcl-2 homology region 4 

Caspase recruitment domain 

Death domain 

Death effector domain 

Domain present in Hsp70 regulators 

ICE-like protease (caspase) p20 domain 

Inhibitor of Apoptosis domain 

. Cytoskeietat 

Actin 
Annexin 
Calponin family 

FERM domain (Band 4.1 family) 
Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

Villin headpiece domain 
Vinculin family 

ECM 'adhesion 
Collagen triple helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
EGF-like domain 

Fibrinogen beta and gamma chains, C-terminal 

globular domain 
Fibronectin type III domain 
Furin-like cysteine rich region 
Integrin alpha cytoplasmic region 
Integrins, beta chain 
Laminin B (Domain IV) 
Laminin EGF-like (Domains Wand V) 
Laminin G domain 
Laminin N-terminal (Domain VI) 
Lectin C-type domain 
Leucine rich repeat C-terminal domain 
Leucine rich repeat N-terminal domain 
Low-density lipoprotein receptor domain class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rich domain 
Sushi domain (SCR repeat) 
Thrombospondin type 1 domain 
von Willebrand factor type A domain 
von Willebrand factor type C domain 
von Willebrand factor type D domain , , 

Protein interaction domains 

14-3-3 proteins 
Ank repeat 

Armadillo/beta-catenin-like repeats 
C2 domain 

Cyclic nucleotide-binding domain 
DnaJ C terminal region 
DnaJ domain 
EF hand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



59 
46 
29(31) 
13 

. 87(95) 
143(182) 
7 
4 
7 

9 
3 
16 
16 
4(5) 
5(8) 
11 
8(14) 

61 (64) 
16(55) 
13(22) 
29 (30) 
4(148) 

2(H) 
31 (195) 

4(12) 
4 
5 
4 

65 (279) 
6(11) 
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myelin proteins result in severe demyelina- 
non t which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (130). Humans 
have at least 10 genes belonging to four 
different families involved in myelin produc- 



THE HUMAN GENOME 



on (five myelin P0, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein families that have expanded in 
humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 



Table 18 {Continued) 



Accession 
number 



Domain name 



Domain description 



H 



W 



PF00254 

PF01590 

PF01344 

PF00560 

PF00917 

PF00989 

PF00595 

PF00169 

PF01535 

PF00536 

PF01369 

PF00017 

PF00018 

PF01740 

PF00515 

PF00400 

PF00397 

PF00569 

PF01754 

PF01388 

PF01426 

PF00643 

PF00533 

PF00439 

PF00651 

PF00145 

PF00385 

PF0012S 

PF00134 

PF00270 

PF01529 

PF00646 

PF0O25O 

PFOO320 

PF01585 

PF00010 

PF00850 

PF00046 

PF01833 

PF02373 

PF02375 

PF00013 

PF01352 

PF00104 

PF00412 
PF0O917 
PF00249 
PF02344 
PF01753 
PF00628 
PF00157 
PF02257 
PF00076 

PF02037 
PF00622 
PF01852 
PF00907 



FKBP 

CAF 

Kelch 

LRR** 

MATH 

PAS 

PDZ 

PH 

PPR** 

SAM 

Sec7 

SH2 

SH3 

STAS 

TPR** 

WD40** 

WW 

ZZ 

Zf-A20 

ARID 

BAH 

Zf-B_box** 
BRCT .. 
Bromodomain 
BTB 

DNAjroethylase 
Chromo 

Histone 

CycUn 

DEAD 

Zf-DHHC 

F-box** 

ForkJiead 

CATA 

G-patch 

HLH** 

Hist.deacetyl 

Homeobox 

TIC 

JmjC 

JmjN 

KH -domain 
KRAB 

Hormone_rec 

UM 
MATH 

Myb.DNA-binding 

Myc-LZ 

Zf-MYND 

PHD 

Pou 

RF*_DNAJ>inding 
Rrm 

SAP 
SPRY 
START 
T-box 



FKBP-type peptidyt-prolyt ds-trans isomerases 

CAF domain 

Ketch motif 

Leucine Rich Repeat 

MATH domain 

PAS domain 

PDZ domain (Also known as DHR or GLCF) 
PH domain 
PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

ZZ-Zinc finger present in dystrophin. CBP/p300 

Nuclear interaction domains 

A20-like zinc finger 
ARID DNA binding domain 
BAH domain 
B-box zinc finger 

BRCA1 C Terminus (BRCT) domain 
Bromodomain 
BTB/POZ domain 

C-5 cytosine-specific DNA methylase 
chromo' (CHRromatin Organization Modifier) 
domain 

Core histone H2A/H2B/H3/H4 
Cydin 

DEAD/DEAH box helicase 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
CATA zinc finger 
G-patch domain 

Helix-loop-helix DNA-binding domain 
Histone deacetylase family 
Homeobox domain 
IPT/TIC domain 
JmjC domain 
JmjN domain 
KH domain 
KRAB box 

Ugand-binding domain of nuclear hormone 

receptor 
UM domain containing proteins 
MATH domain 

Myb-like DNA-binding domain 
Myc teudne zipper domain 
MYND finger 
PHD-finger * 

Pou domain — N-terminal to homeobox domain 
RFX DNA-binding domain 
RNA recognition motif (a.lca. RRM. RBD, or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T-box 



15(20) 
7(8) 
54(157) 
25 (30) 
11 

18(19) 
96(154) 
193 (212) 
5 

29(31) 
13 

87(95) 
143(182) 
5 

72 (131) 
136 (305) 
32 (53) 
10(11) 



7(8) 


7(13) 
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24 (29) 


2(4) 
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10 


12(48) 


13(41) 
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102(178) 


24 (30) 
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15(16) 


5 
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61 (74) 
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13 (18) 


60 (87) 


46(66) 


2 


5 


72 (78) 


65(68) 


24 


23 


3(4) 
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474 (2485) 
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8 


3 


6 
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5 


9 


33 (39) 


44(48) 


1 


3 


55 (75) 


46 (61) 


23 (27) 


4 


1 


6 


2 


13 


39(101) 


28 (54) 


16(31) 


65 (124) 


98 (226) 


72 (153) 


56 (121) 


167 (344) 


24 (39) 


16(24) 


5(8) 


11(15) 


13 


10 


2 


10 
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2 


? 


o 

V 


Q 
O 


11 


6 


4 


2 


7 


8(10) 


7(8) 


4(5) 


5 


21 (25) 


32 (3S) 


1 


2 


0 


0 


17(28) 


10(18) 


23(35) 


. 10(16) 


12(16) 


37(48) 


16(22) 


18(26) 


10(15) 


28 


97(98) 


62 (64) 


86(91) 


1 (2) 


30(31) 


3(4) 


1 


0 


0 


13(15) 


24(27) 


14(15) 


17(18) 


1(2) 


12 


75 (81) 


5 


71 (73) 


8 


48 


19 


10 


10 


11 


35 


63 (66) 


48(50) 


55 (57) 


50 (52) 


84(87) 


15 


20 


16 


7 


22 


16 


15 


309 (324) 


9 


165 (167) 


35 (36) 


20 (21) 


15 


4 


0 


11(17) 


5(6) 


8(10) 


9 


26 


18 


16 


13 


4 


14(15) 


60(61) 


44 


24 


4 


39 


12 


5(6) 


8(10) 


5 


10 


160(178) 


100(103) 


82 (84) 


6 


66 


29 (53) 


11(13) 


5(7) 


2 


1 


10 


4 


6 


4 


7 


7 


4 


2 


3 


7 


28(67) 


14(32) 


17(46) 


4(14) 


27(61) 


204(243) 


0 


0 


0 


0 


47 


17 


142(147) 


0 


0 


62 (129) 


33 (83) 


33 (79) 


4(7) 


10(16) 


11 


5 


88(161) 


1 


61 (74) 


32 (43) 


18 (24) 


17(24) 


15(20) 


243 (401) 


1 


0 


0 


0 


0 


14 


14 


9 


1 


7 


68(86) 


40(53) 


32 (44) 


14(15) 


96(105) 


15 


5 


4 


0 


0 


7 


2 


1 


1 


0 


224 (324) 


127(199) 


94 (145) 


43 (73) 


232 (369) 


15 
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5 


5 


6(7) 


44(51) 


10(12) 


5(7) 


3 


6 


10 
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17(19) 
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Table 18 [Continued) 



t: 



Accession 
number 



Domain name 



Domain description 



PF02135 
PF01285 
PF02176 
PF00352 

PF00567 
PF00642 
PF00096 
PF00097 
PF00098 



Zf-TAZ 
TEA 

Zf-TRAF 
TBP 

TUDOR 

Zf-CCCH 

Zf-C2H2** 

Zf-C3HC4 

Zf-CCHC 



TAZ finger 
TEA domain 
TRAF-type zinc finger 

Transcription factor TFIID (or TATA-binding 

protein, TBP) 
TUDOR domain 

Zinc finger C-x8-C-x5-C-x3-H type (and. similar) 

Zinc finger, C2H2 type 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 



H 


F 


W 


Y 


A 


2(3) 
4 

.6(9) 
2(4) 


1(2) 
1 

1(3) 
■ 4(8) 


6(7) 
1 
1 

2(4) 


0 
1 
0 

1(2) 


10(15) 
0 
2 

2(4) 


9(24) 
17(22) 
564 (4500) 
135 (137) 
9(17) 


9(19) 
6(8) 
234 (771) 
57 
6(10) 


4(5) 
22 (42) 
68(155) 
88(89) 
17(33) 


0 

3(5) 
34 (56) 
18 
7(13) 


2 

31(46) 
21 (24) 
298 (304) 
68 (91) 



(Tables 18 - and 19). They include secreted 
hormones and growth factors, receptors, in- 
tracellular signaling molecules, and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-0 (TGF-0), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 12 
ephrin receptors (2 in the fly, 1 in the worm). In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in . 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(131). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (132), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (133). A similar expansion in humans 
is noted in structural proteins that institute the . 
actin-cytoskeletal architecture; Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



... Comparison across the five sequenced eu- 
. karyotic organisms revealed several expand- 
ed protein families and domains involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
. riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. Al- 
though there are about the same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the SH2, PTB, and ITAM domains involved 
in phosphotyrosine signal transduction. Fur- 
ther, there is a twofold expansion of phos- 
phodiesterases in the human genome com- 
pared with either the worm or fly genomes. 
The downstream effectors of the intracellu- 
. lar signaling molecules include the transcription 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- ■ 
, binding nuclear hormone receptor class of tran- 
. .scription factors compared with the fly genome, 
.although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN do- 
mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



homeodomains alone or in combination with 
Pou and LIM domains in all of the animal 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
: myb family, and a unique set that includes VP 1 

* and AP2 domain-<ontaining proteins (134). 

• The yeast genome has a paucity of transcription 
factors compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation. 

While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 
most of the protein domains are highly con- 
served An interesting observation, is that 
.worms and humans have approximately the 
. same number of both tyrosine kinases and 
serine/threonine kinases (Table 19). It is im- 
portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 
wide repertoire of interaction domains with 
. significant combinatorial diversity. 

- Hemostasis. Hemostasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there has been extensive re- 
cruitment of more-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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significant expansion in two families of matrix 
metalloproteases: ADAM (a disintegrin and 
metalloprotease) and MMPs (matrix metallo- 
proteases) (Table 19). Proteolysis of extracel- 
lular matrix (ECM) proteins is critical for tissue 
development and for tissue degradation in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
ease, and a variety of inflammatory conditions 
{135, 136). ADAMs are a family of integral 
membrane proteins with a pivotal role in fibrin- 
ogenolysis and modulating interactions : be- 
tween hematopoietic components and the 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM- 17 
converts tumor necrosis factor-<x, and 
ADAM- 10 has been implicated in the Notch 
signaling pathway {135). We have identified 
19 members of the matrix metalloprotease 
family, and a total of 51 members of the 
ADAM and ADAM-TS families. 

Apoptosis. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 
regulatory enzymes {137). We enumerated 
the protein counts of central adaptor, and ef- 
fector enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and Bcl2 are represent- 
ed in the fly and worm (although the number 
of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms {138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector doniain-<ontaining proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
oxygenase-activating proteins (four in humans) 
may be vertebrate-specific. Lipoxygenases are 
involved in arachidonic acid metabolism, and 
they and their activators have been implicated 
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in diverse human pathology ranging from 
allergic responses to cancers. One of the most 
surprising human expansions, however, is in 
the number of glyceraIdehyde-3-phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans, 3. in the fly, and 4 in the worm). There 
is, however, evidence for many retrotrans- 




posed GAPDH pseudogenes (73P), which 
may account for this apparent expansion. 
However, it is interesting that GAPDH, long 
known as a conserved enzyme involved in 
basic metabolism found across alfphyla from 
bacteria to humans, has recently been shown 
to have other functions. It has a second cat- 



Table 19. Number of proteins assigned to selected Panther families or subfamilies in H saoiens (H\ D 
melanogaster (F). C etegans (W), 5. cerevisiae (Y), and A thaiiana (A). ^sapiens (H), D. 
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alytic activity, as a uracil DNA glycosylase 

(140) and functions as a cell cycle regulator 

(141) and has even been implicated in apo- , 
ptosis (142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
that each have at least 10 copies in the ge- 
nome; on average, for all ribosomal proteins 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 

Table 19 [Continued) 



The Human genome 

may account for many of these expansions 
[see the discussion above and (143)]. Recent 
.evidence suggests that a number of ribosomal 
proteins have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, LI 3a and the related L7 
subunits (36 copies in humans) have been 
shown to induce apoptosis (144). 

,There is also a four- to.fivefold expansion 
in the . elongation * factor 1 -alpha family 
(eEFIA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 
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transposition, and again there is evidence that 
many of these may be .pseudogenes (145). 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
mentary expression pattern to the ubiquitous- 
ly expressed eEFIA (146). 

Ribonucleoproteins. .Alternative, splicing 
results in multiple transcripts .from a; single 
gene, and can therefore generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the worm, two times that of the fly, and 
about the same as the 265 identified in the 
Arabidopsis genome. Whether the diversity 
• of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unknown. 

Posttranslational modifications. In this 
set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K- dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA protein (148). 
Tyrosylprotein sulfotransferases participate . 
in the posttranslational modification of pro- 
teins involved in mflammation and hemosta- 
sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. . 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate to the. prominent differences in 
the immune system, hemostasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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- increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (150). Evolution of apparently new 
(from the perspective of sequence analysis) 
protein domains and increasing regulatory 
■ complexity by domain accretion both quanti- 
tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that, we observe in humans. Perhaps 
the best illustration of this trend is the C2H2 
zinc finger-<ontaining transcription factors, 
where we see expansion in the number of 
domains per protein, together with verte- 
brate-specific domains such as KRAB and 
SCAN. Recent reports on the prominent use 
of internal ribosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify the full 
extent of this process in the human genome 
(151). At the posttranslational level, although 
we provide examples of expansions of some 
protein families involved in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate whether this is correlated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the 
extent of isofonn generation in the human 
remain to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. 

8 Conclusions 
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8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
success of the method for a large number of 
microbial genomes, Drosophila, and now the 
human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (15, 80, 152) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. With more 
complex genomes like those of Drosophila or 
human, map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the quality of the 
map (in terms of the order of the markers) is 
more important than the number of markers 
per sc. Although this mapping could have 
been performed concurrently with sequenc- 
ing, the prior existence of mapping data was 
beneficial. During the sequencing of the A. 
thaliana genome, sequencing of individual 
BAC clones permitted extension of the se- 
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quence well into centromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, in Drosophila, the 
BAC physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstructions of the 
unique regions of the genome. As the genome 
size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific applica- 
tions of B AC-based or other clone mapping and 
sequencing strategies to resolve ambiguities in . 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
are clearly worth exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAC clone se- 
quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3X coverage of both whole-genome and 
BAC shotgun sequence data. 



8.2 The low gene number in humans 

We have sequenced and assembled —95% of 
the euchromatic sequence , of H. sapiens and 
used a new automated gene prediction meth- 
od to produce a preliminary catalog of the 
human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- 
nome), and careful molecular dissection of 
complex "phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will "occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5 '-untranslated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
. mRNA in specific cell types to demonstrate 
the presence of a gene. 

J. B. S. Haldane speculated in 1937 that a 
population of organisms might, have to pay a 
. price for the number, of genes it can possibly 
-carry. He theorized that when the number of 
genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot maintain itself. On 
the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, Muller, in 1967 
(154), calculated that the mammalian ge- 
. nome would contain a maximum of not much 
more than 30,000 genes (1 55), An estimate of 
. 30,000 gene loci for humans was also arrived 
at by Crow and Kimura (156). Muller's esti- 
mate fori), melanogaster was 10,000 genes, 
compared to 13,000 derived by annotation of 
. the fly genome (26, 27). These arguments for 
the theoretical maximum gene number were 
. based on simplified ideas of genetic load — 
that all genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cernible phenotypic perturbations. 

. The modest number of human genes 
means that we must look elsewhere for the 
mechanisms that generate the complexities 
. inherent in human development and the so- 
phisticated signaling systems that maintain 
homeostasis. There are a large number of 
ways in which the functions of individual 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure 
and hence transcriptional activity is regulated - 
- by. protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (157); meth- 
ylation of CpG islands in imprinting (158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The spliceosomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



of RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clinical and biological relevance (161). Final- 
ly, examples of translational control include 
internal ribosomal entry sites that are found 
in proteins involved in cell cycle regulation 
and apoptosis (162). At the protein level, 
; minor alterations in the nature of protein- 
. .protein interactions, protein - modifications, 
and localization can have dramatic effects on 
cellular physiology (163). This dynamic sys- 
tem therefore has many ways to modulate 
activity, which suggests that definition of 
complex systems by analysis of single genes 
is unlikely to be entirely successful. 

In situ studies have shown that the human 
. genome is asymmetrically populated with 
* G+C content, CpG islands, and genes (68). 
. However, the genes are not distributed quite 
as unequally as had been predicted (Table 9) 
(69). The most G+C-rich fraction of the ge- 
nome, H3 isochores, constitute more of the 
genome than previously thought (about 9%), 
. and are the most gene-dense fraction, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate genome (77). Why 
are there clustered regions of high and low 
gene density, and are these accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
sizes that are much smaller than that of hu- 
...rnans; for example, Miniopterus, a species of 
Italian bat, has a genome size that is only 
50% that of humans (164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is -70% that of humans. 



8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 



1346 



16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag. 



the Human Genome 



types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modem human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, -and admix-, 
ture, SNPs can serve as markers for the extent 
of evolutionary constraint acting on particular 
genes. The correlation between patterns of in- 
traspecies and interspecies genetic variation 
may prove to be especially informative to iden- 
tify sites of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism — sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
. one-quarter as many Y chromosomes in the . 
population as there are autosomal chromo- . 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density of deleterious mutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
ila, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 



8.4 Genome complexity 

We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



then docks on this, and then the complex 

moves there " (167) to the exciting area 

of network perturbations, . nonlinear re- 
sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other "parts lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number, 
nor number of cell types correlates in any" 
meaningful manner with even simplistic mea- 
: sures of stmctiiral or behavioral .complexity. 
Nor would they be expected to; this is the realm 
of nonlinearities and epigenesis (168). The 520 
million neurons of the common octopus exceed . 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on the mouse and 
human, and from comparative mammalian neu- 
roanatomy (169), that the morphological and 
behavioral diversity found in mammals is un- 
■ derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to a 
, chimpanzee, the brain volume of this minute 
. primate is found to be only about 1.5 cm 3 , two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
chimpanzees, the gene number, gene structures 
and functions, chromosomal and genomic or- 
ganizations, and cell types and neuroanatomies 
are almost indistinguishable, yet the develop- 
mental modifications that predisposed human 
lineages to cortical expansion and development 
of the larynx, giving rise to language, culminat- 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

, Simple examination of the number of neu- 
rons, cell types, or genes or of the genome 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation. In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-0, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



8.5 Beyond single components 

While few would disagree with the intuitive 
•conclusion that Einstein's brain was more 
complex than that of Drosophila, closer com- 
parisons such as whether the set of predicted 
human proteins is more complex than the 
protein set of Drosophila, and if so, to., what 
degree, are not straightforward, since protein, 
. protein domain, or protein-protein interaction 
measures do not capture . context-dependent 
interactions that underpin, the dynamics : un- 
derlying phenotype. 

Currently, there are more than 30 different 
mathematical descriptions of complexity (170). 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
ic approach to the analysis of biological sys- 
tems, which are composed of nonidenrical ele- 
ments (proteins, protein complexes, interacting 
cell types, and interacting neuronal popula- 
tions), is through graph theory (171). The ele- 
ments of the system can be represented by the 
vertices of complex topographies, with the edg- 
es representing the interactions between them. 
Examination of large networks reveals that they 
can self-organize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene knockouts provide an 
illustration. Some knockouts may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious pheno- 
typic effects (172), and yet the usually conspic- 
uous vimentin network is completely absent 
On the other hand, -30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or "bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets mat spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity,'' particularly because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nome would open up new strategies for hu- 
man biological research and would have a 
major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research are already being felt. 
This assembly of the human genome se- 
quence is but a first, hesitant step on a long 
and exciting journey toward understanding 
the role of the genome in human biology. It 
has been possible only because of innova- 
tions in instrumentation and software that 
have allowed automation of almost every step 
of the process from DNA preparation to an- 
notation. The next steps are clear: We must 
. define the complexity that ensues when this 
relati vely modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- 
istry, physiology, and ultimately phenotype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described; and the relation 
. . between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
. public discussion .of this information and its 
potential for improvement of personal health. 
Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
are "hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence. 
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THE HUMAN 
GENOME 



A historic 
moment for 
the scientific 
endeavor. 



umanity has been given a great gift. With the completion of the human 
genome sequence, we have received a powerful tool for unlocking the 
secrets of our genetic heritage and for finding our place among the other 
participants in the adventure of life. 

This week's issue of Science contains the report of the sequencing of 
the human genome from a group of authors led by Craig Venter ; of Celera 
Genomics. The report of the sequencing of the human genome from the 
publicly funded consortium of laboratories led by Francis Colliiis appears 
in this week's Nature, This stunning achievement has been portrayed — 
often unfairly — as a competition between two 
ventures, one public and one private. That characterization detracts from 
the awesome accomplishment jointly unveiled this week. In truth, each 
project contributed to the other. The inspired vision that launched the 
publicly funded project roughly 10 years ago reflected, and now rewards, 
the confidence of those who believe that the pursuit of large-scale funda- 
mental problems in the life sciences is in the national interest The technical 
innovation and drive of Craig Venter and his colleagues made it possible 
to celebrate this accomplishment far sooner than was believed possible. 
Thus, we can salute what has become, in the end, not a contest but a 
marriage (perhaps encouraged by shotgun) between public funding and 
private entrepreneurship. 

There are excellent scientific reasons for applauding an outcome that . ^ 
has given us two winners. Two sequences are better than one; the opportunity for comparison and con- 
vergence is invaluable. Indeed, a real-world proof of the importance of access to both sets of data can 
be found in the pages of this issue of Science, in the comparative analysis by Olivier et al (p. 1 298). 

Although we have made the point before, it is worth repeating that the sequencing of the human 
genome represents, not an ending, but the beginning of a new approach to biology. As Galas say? in 
his Viewpoint (p. 1257), the knowledge that all of the genetic components of any process can .be 
identified will give extraordinary new power to scientists. Because of this breakthrough, research 
can evolve from analyzinglhe effects of individual genes to a more integrated view that examines 
whole ensembles of genes as they interact to form a living human being. Several articles in this issue 
highlight how this approach is already beginning to revolutionize the way we look at human disease. 

This has been a massive project, on a scale unparalleled in the history of biology, but of course 
it has built on the scientific insights of centuries of investigators. By coincidence, this landmark 
announcement falls during the week of the anniversary of the birth of Charles Darwin. Darwinls 
message that the survival of a species can depend on its ability to evolve in the face of change is 
peculiarly pertinent to discussions that have gone on in the past year over access to the Celera data. 
(Full information regarding the agreements that were reached to make the data available can l?e 
found at ww.sciencemag.org/feature/data/announcement/gsp.shl.) We are willing to be flexible in 
allowing data repositories other than the traditional GenBank, while insisting on access to all the; 
data needed to verify conclusions. In this domain, change is everywhere: Commercial researchers 
are producing more and more potentially valuable sequences, yet (at least in the United States) 
laws governing databases provide scant protection against piracy. Had the Celera data been kept se- 
cret, it would have been a serious loss to the scientific community. We hope that our adaptability m 
the face of change will enable other proprietary data to be published after peer review, in a way tjiat 
satisfies our continuing commitment to full access. • 

It should be no surprise that an achievement so stunning, and so carefully watched, has created * 
new challenges for the scientific venture. Science is proud to have played a role in bringing this 
discovery onto the public stage. It is literally true that this is a historic moment for the scientific en- 
deavor. The human genome has been called the Book of Life. Rather, it is a library, in which, with 
rules that encourage exploration and reward creativity, we can find many of the books that will 
help define us and our place in the great tapestry of life. 

Barbara R. Jasny and Donald Kennedy 



www.sciencemag.org SCIENCE VOL291 16 FEBRUARY 2001 



1153 




Serial # 09/775,308 Exhibit L 

Novel Human 7TM Proteins and Polynucleotides Encoding the Same 




Query= SEQ ID NO: 8 

(924- letters) 



Sequences producing significant alignments: 



AC091612. 4. 1.180657 



Score E 
(bits) Value 

1824 0.0 



>AC091612. 4. 1.180657 

Length = 180657 

Score = 1824 bits (920), Expect = 0.0 
Identities = 923/924 (99%) 
Strand = Plus / Minus 



\ 



\ 



Query: 1 atgaatcacagcgttgtaactgagt teat tat tctgggcctcaccaaaaagcctgaactc 60 

Sbjct : 155813 atgaatcacagcgttgtaactgagttcattattctgggcctcaccaaaaagcctgaactc 155754 



Query: 61 cagggaattatcttcctcttttttctcattgtctatcttgtggcttttctcggcaacatg 120 

1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 f 1 1 f 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 155753 cagggaattatcttcctcttttttctcattgtctatcttgtggcttttctcggcaacatg 155694 



Query: 121 ctcatcatcattgccaaaatctatagcaacaccttgcatacgcccatgtatgttttcctt 180 



Sbjct : 155693 ctcatcatcattgccaaaatctataacaacaccttgcatacgcccatgtatgttttcctt 155634 



Query: 181 ctgacactggctgttgtggacatcatctgcacaacaagcatcataccgaagatgctgggg 240 



Sbjct: 155633 ctgacactggctgttgtggacatcatctgcacaacaagcatcataccgaagatgctgggg 155574 



Query: 241 accatgctaacatcagaaaataccatttcatatgcaggctgcatgtcccagctcttcttg 300 

I Ml I II II 1 1 1 II I MMI Ml II III III Ml II 1 1' MM I 1 1 MMMMIM M 

Sbjct : 155573 accatgctaacatcagaaaataccatttcatatgcaggctgcatgtcccagctcttcttg 155514 



Query: 301 ttcacatggtctctgggagctgagatggttctcttcaccaccatggcctatgaccgctat r 360 



Sbjct: 155513 ttcacatggtctctgggagctgagatggttctcttcaccaccatggcctatgaccgctat 155454 



Query: 361 gtggccatttgtttccctcttcattacagtactattatgaaccaccatatgtgtgtagcc 420 

1 1 i 1 1 1 1 1 1 M 1 1 II 1 1 1 1 1 1 1 1 il 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Ml 1 1 1 

Sbjct : 155453 gtggccatttgtttccctcttcattacagtactattatgaaccaccatatgtgtgtagcc 155394 



Query: 421 ttgctcagcatggtcatggctattgcagtcaccaattcctgggtgcacacagctcttatc 480 



Sbjct: 155393 ttgctcagcatggtcatggctattgcagtcaccaattcctgggtgcacacagctcttatc 155334 



Query: 481 - atgaggttgactttctgtgggccaaacaccattgaccacttcttctgtgagataccccca 540 

Mill 1 1 Mil III II I Ml III IIMIIMIIII III IMIIMII MM I! M 1 1 II 

Sbjct : 155333 atgaggttgactttctgtgggccaaacaccattgaccacttcttctgtgagataccccca 155274 
Query: 541 ttgctggctttgtcctgtagccctgtaagaatcaatgaggtgatggtgtatgttgctgat 600 

MM Milll MM M I IMMIMI MIMMIMM Ml Ml II I Mi I M M 1 1 II 

Sbjct : 155273 ttgctggctttgtcctgtagccctgtaagaatcaatgaggtgatggtgtatgttgctgat 155214 
Query: 601 attaccctggccataggggactttattcttacctgcatctcctatggttttatcattgtt 660 

1 1 1 1 1 1 I 1 1 1 1 1 1 1 1 !! 1 1 ; i !! II I i I III Ml Ml M 1 1 1 1 ! 1 1 I 1 1 1 1 M M M 1 1 1 

Sbjct : 155213 attaccctggccataggggactttattcttacctgcatctcctatggttttatcattgtt 155154 
Query: 661 gctattctccgtatccgcacagtagaaggcaagaggaaggccttctcaacatgctcatct 720 

II Ml 1 1 Ml II M II IM I MMMMIMMIMIMI Ml Ml Ml I M M I Ml 

Sbjct : 155153 gctattctccgtatccgcacagtagaaggcaagaggaaggccttctcaacatgctcatct 155094 
Query: 721 catctcacagtggtgaccctttactattctcctgtaatctacacctatatccgccctgct 780 

M M 1 1 1 M 1 1 1 M 1 1 1 1 II M 1 1 1 II III MM 1 1 1 ll 1 1 1 1 1 1 1 1 1 M 1 1 1 1 M I M 

Sbjct : 155093 catctcacagtggtgaccctttactattctcctgtaatctacacctatatccgccctgct 155034 
Query: 781 tccagctatacatttgaaagagacaaggtggtagctgcactctatactcttgtgactccc 840 

1 1 1 1 1 u 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 155033 tccagctatacatttgaaagagacaaggtggtagctgcactctatactcttgtgactccc 154974 
Query: 841 acattaaacccgatggtgtacagcttccagaatagggagatgcaggcaggaattaggaag 900 

1 1 M I : M 1 1 MM M I M 1 1 1 1 II 1 1 MM II M 1 1 1 1 1 III I II 1 1 1 M 1 1 1 1 1 1 1 1 1 

Sbjct : 154973 acattaaacccgatggtgtacagcttccagaatagggagatgcaggcaggaattaggaag 154914 
Query: 901 gtgtttgcatttctgaaacactag 924 

1 1 M I M 1 1 1 M 1 1 1 1 1 1 1 1 ! 1 1 1 

Sbjct: 154913 gtgtttgcatttctgaaacactag 154890 
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Details 



C 1: AC091612. Homo sapiens chro...[gi:18497169] 

LOCUS AC091612 180657 bp DNA . linear HTG 05-FEB-2002 

DEFINITION. Homo sapiens chromosome 1 clone RPll-656022, WORKING DRAFT 

" SEQUENCE, 1 unordered piece. 
ACCESSION AC091612 AL390860 
VERSION AC091612.4 GI: 18497169 

KEYWORDS HTG; HTGS_PHASEl ; HTG S — DRAFT; HTGS_FULLTOP . 
SOURCE Homo sapiens (human) 

ORGANISM Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
REFERENCE ' 1 (bases 1 to 180657) 

AUTHORS Kaul,R.K., Olson, M.V., Raymond, C. and Haugen,E.D. 

TITLE Direct Submission 

JOURNAL Unpubl i shed 
REFERENCE 2 (bases 1 to 180657) 

AUTHORS Kaul ,R.K. , Olson, M . V . , Raymond, C . , . Clendenning, J . , Ivey , R . G . and 

Haugen,E.D. 
TITLE Direct Submission 

JOURNAL Submitted (09-MAY-2001) Genome Center, University of Washington, 

Box 352145, Seattle, WA 98195, USA 
COMMENT On Feb 5, 2002 this sequence version replaced gi: 15487406. 

_ Genome Center 

Center: University of Washington Genome Center 
Center Code: UWGC 

Web site: h t tp : / /www . genome . Washington . edu 
Contact : uwgchtgsGu . Washington . edu 
Drafting Center: SC 
Project Information 

Center project name: chr-1 

Center clone name: RP11-656022 (sc0182) 
_ — _ Summary Statistics 

Sequencing vector: plasmid; L08752; 100% of reads 

Chemistry: Dye-terminator Big Dye; 100% of reads 

Assembly program: Phrap; version 0.990319 

Consensus quality: 180536 bases at least Q40 

Consensus quality: 180650 bases at least Q30 

Consensus "quality: 180657 bases at least Q20 

Insert size: 194815; 11.0%. error; agarose-fp 

Insert size: 180657; sum-of-contigs 

Quality coverage: 8.4x in Q20 bases; agarose-fp 

Quality coverage: 9. Ox in Q20 bases; sum-of-contigs 
—————————————— • 

* NOTE: This is a 'working draft' sequence. It currently 

* consists of 1 contigs. The true order of the pieces 

* is not known and their order in this sequence record is 

* arbitrary. Gaps between the contigs are represented as 

* runs of N, but the exact sizes of the gaps are unknown. 
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ABSTRACT 



The present invention provides polynucleotides (kin) which 
identify and encode novel protein kinases (KIN) expressed 
in various human cells and tissues. The present invention 
also provides for antisense sequences and oligonucleotides 
designed from the nucleotide sequences or their comple- 
ments. The invention further provides genetically engi- 
neered expression vectors and host cells for the production 
of purified KIN peptides, antibodies capable of binding KIN, 
and inhibitors specifically bind KIN. The invention specifi- 
cally provides for diagnostic kits and assays which identify 
a disorder or disease with altered kinase expression and 
allow monitoring of patients during drug therapy. These 
assays utilize oligonucleotides or antibodies produced using 
the kin polynucleotides. 

4 Claims, No Drawings 
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HUMAN KINASE HOMOLOGS ril ^ e ^^^^^^ 

FIELD OF THE INVENTION AMP (cAMP) cyclic GMP, inositol triphosphate, 

phosphatidylinositol, 3,4,5 -triphosphate, cyclic ADPribose, 

The present invention is in the field of molecular biology; arachidonic acid and diacylglycerol. For purposes of 

more particularly, the present invention describes nucleic example, the structure and function of cyclic AMP- 

acid sequences for novel human kinase homologs. dependent protein kinase (A-kinase) will be described. 

_ „ nr¥ , imrtVT Mammalian cells generally contain at least two forms of 

BACKGROUND OF THE INVENTION Ztonuc; type 1 which is cytosolic, and type 2 which is 

Kinases regulate many different cell proliferation, bound to plasma membrane, nuclear membrane or microtu- 

differentiation, and signalling processes by adding phos- bules. I nits inactive state, A-kinase consists of a complex of 

phate groups to proteins. Uncontrolled signalling has been two catalytic subunits and two regulatory subunits. When 

implicated in inflammation, oncogenesis, arteriosclerosis, each regulatory subunit has bound two molecules of cAMP, 

and psoriasis. Reversible protein phosphorylation is the the catalytic subunit is activated and can transfer a high 

main strategy for controlling activities of eukaryotic cells. It energy phosphate from ATP to the serine or threonine of a 

is estimated that more than 1000 of the 10,000 proteins substrate protein. Substrate proteins are usually marked by 

active in a typical mammalian cell are phosphorylated. The the presence of two or more basic amino acids on their 

high energy phosphate which drives activation is generally amino terminal sides. A-kinase is important in metabolism 

transferred from adenosine triphosphate molecules (ATP) to 0 f glycogen, for inactivation of phosphatase inhibitor 

a particular protein by protein kinases and removed from protein, in transcription of genes which contain a regulatory 

that protein by protein phosphatases. region called the cAMP response element (CRE), and in 

Phosphorylation occurs in response to extracellular sig- regulation of the ion channels of olfactory neurons, 
nals (hormones, neurotransmitters, growth and differentia- Protein kinase C (PKC) is a water-soluble, Cap- 
tion factors, etc), cell cycle checkpoints, and environmental dependent kinase, commonly found in brain tissue, which 
or nutritional stresses and is roughly analogous to the 25 moves to the plasma membrane in the presence of Ca"*"* ions, 
turning on a molecular switch. When the switch goes on, the Approximately half of the known isoforms of PKC are 
appropriate protein kinase activates a metabolic enzyme, activated initially by diacylglycerol and phosphatidylsenne. 
regulatory protein, receptor, cytoskeletal protein, ion chan- Prolonged activation of PKC depends on continued produc- 
nel or pump, or transcription factor. tion of diacyglycerol molecules which are formed when 

Tne kinases comprise the largest known protein family, a 30 phospholipases cleave phosphatidylcholine. In nerve cells, 

superfamily of enzymes with widely varied functions and PKC phosphorylates ion channels and alters the excitability 

specificities. They are usually named after their substrate, of the cell membrane. In other cells, activation of PKC 

their regulatory molecules, after some aspect of a mutant increases gene transcription either by triggering a protein 

phenotype or arbitrarily: Almost all kinases contain a similar kinase cascade which activates a regulatory element (much 

250-300 amino acid catalytic domain. The N-terminal 35 like CRE above) or by phosphorylating and deactivating an 

domain, which contains subdomains I-IV, generally folds inhibitor of the regulatory protein. 

into a two-lobed structure and binds and orients the ATP (or Ca + 7calmodulin-dependent protein Jcinases (CaM- 

GTP) donor molecule. The larger C terminal lobe, which kinases) mediate most of the actions of Ca~ in human cells, 

contains subdomains VIA-XI, binds the protein substrate The CaM-kinases include enzymes with narrow substrate 

and carries out the transfer of the gamma phosphate from 40 specificity such as myosin light chain kinase which activates 

ATP to the hydroxyl group of a serine, threonine, or tyrosine smooth muscle contraction and phosphorylase kinase which 

residue. Subdomain V spans the two lobes. activates glycogen breakdown and the multifunctional 

Tne kinases may be categorized into families by the enzyme, CaM-kinase II which is found in all cells. Phos- 

different amino acid sequences (generally between 5 and phorylase kinase has four subumfc: y is the catalytic moiety 

100 residues) located on either side of, or inserted into loops 45 and a, p and Ob are regulatory. Since subunits a and p are 

of, the kinase domain. These added amino acid sequences phosphorylated by A-kinase and subunit Do is Ca / 

allow the regulation of each kinase as it recognizes and calmodulin, glycogen breakdown can be activated by either 

interacts with its target protein. The primary structure of the cAMP or Ca 4 "*". ... 

kinase domains is conserved and can be further subdivided CaM-kinase II is particularly enriched in catecholamine 

into 12 subdomains. The following residues are relatively 50 synapses. In those neurons, Ca~ influx stimulates both the 

(-95%) invariant: G 50 and G 52 in subdomain I, K^ in release of dopamine, noradrenaline or adrenaline and also 

subdomain II, G 91 in subdomain III, E^ in subdomain VIII, their resynthesis through the activation of CaM-kinase II. 

D 2M and G«< in subdomain IX, and the motifs or patterns Although the main role of CaM-kinase II is phosphorylation 

of amino acids in subdomains VIB, VIII and IX (Hardie G. of tyrosine hydroxylase, the rate-limiting enzyme of cat- 

and Hanks S. (1995) The Protein Kinase Facts Books, I and 55 echolamine synthesis, CaM-kinase II also autophosphory- 

II, Academic Press, San Diego, Calif.). lates and remains active until phosphotases overwhelm it. 

The cyclin dependent protein kinase (cdk) family includes Transmembrane protein-tyrosine kinases are receptors for 

proteins which are turned on and off as the cell proceeds most growth factors. The first characterized receptor for 

through the cell cycle. A cdk is active as a kinase only when epidermal growth factor (EGF) is a single pass transmem- 

it is bound to a cyclin. Cdk activation simultaneously 60 brane protein of about 1200 ammo acids with an extracel- 

requires both the addition of a high energy phosphate to a hilar glycosylated portion that interacts with the 53 amino 

threonine residue by a kinase and the removal of a acid EGF molecule. Binding activates the transfer of a 

covalently-bound phosphate from a specific tyrosine residue phosphate group from ATP to selected tyrosine side chains 

by a phosphatase. The concentration of some cyclins rises of the receptor and other specific proteins. Other protein 

gradually through a particular part of the cell cycle until their 65 receptors with similar structure include the following growth 

targeted proteolysis ends the coordinated interaction among and differentiation factors (GF)— platelet derived GF, fibro- 

the cyclin, kinase, and phosphatase molecules. blast GF, hepatocyte GF, insulin and insulin-like GFs, nerve 
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GF, vascular endothelial GF, macrophage colony stimulating 
factor, etc. Each protein phosphorylates itself by receptor 
dimerization to initiate the intracellular signalling cascade. 

Many protein-tyrosine kinases lack transmembrane 
regions and form a complex with the intercellular regions of 5 
other cell surface receptors. The best known NR-PTKs are 
the Src kinase family (Src, Yes, Fgr, Fyn, Lck, Lyn, Hck, 
Blk, etc) and the Janus kinase family (Jakl, Jak2, Jak3, 
Tyk2, etc). The Src PTKs are located on the cytoplasmic side 
of the plasma membrane and are characterized by Src 10 
homology regions 2 and 3 (SH2 and SH3). Src PTKs 
recognize short peptide motifs bearing phosphotyrosine or 
proline residues, respectively, and mediate protein-protein 
interactions that regulate a whole range of intracellular 
signalling molecules. Janus PTKs contain PTK or PTK-like 15 
domains and interact with growth hormone, prolactin, and 
some of the same cytokine receptors as Src PTKs. The 
cytokine receptors are unique both in their ability to recruit 
multiple PTKs and in the diversity of their intracellular 
domains which allow flexibility in their responses within 20 
different cell types (Taniguchi T. (1995) Science 
268:251-55). Src and Jak kinases were first identified as the 
products of mutant oncogenes in cancer cells where their 
activation was no longer subject to normal cellular controls. 

Extracellular signalling proteins such as transforming 25 
growth factor-p (TGF-fJ), activins, bone morphogenetic 
protein, and related members of the TGF-p superfamily 
interact with receptor serine/threonine kinases. Like EGF 
above, these receptor kinases have a single pass transmem- ^ 
brane domain with a serine/threonine kinase residue on the 
cytosolic side of the plasma membrane. The signalling 
pathways which are activated by binding the extracellular 
signalling molecules are presently under investigation. 

Mitogen-activated protein (MAP) kinases also regulate 35 
intracellular signalling pathways. They mediate signal trans- 
duction from cell surface to nuclei via phosphorylation 
cascades. Several subgroups have been identified, and each 
manifests different substrate specificities and responds to 
distinct extracellular stimuli (Egan S. E. and Weinberg R. A. ^ 
(1993) Nature 365:781-783). 

MAP kinase signalling pathways are present in mamma- 
lian cells as well as in yeast. The extracellular stimuli which 
activate mammalian pathways include epidermal growth 
factor (EGF), ultraviolet light, hyperosmolar medium, heat 45 
shock, endotoxic lipopolysaccharide (LPS), and pro- 
inflammatory cytokines such as tumor necrosis factor (TNF) 
and interleukin-1 (\L-l). In Saccharomyces cerevisiae 9 
exposure to mating pheromone or hyperosmolar environ- 
ments activate the various MAP kinase signalling pathways. 50 

Mammaiian cells have at least three subgroups of MAP 
kinases (Derijard B. et al (1995) Science 267:682-5), each 
distinguished by a tripeptide motif. They are extracellular 
signal-regulated protein kinases (ERK) characterized by 
Thr-Glu-Tyr; c-Jun amino-terminal kinases (JNK) charac- 55 
terized by Thr-Pro-iyr; and p38 kinase characterized by 
Thr-Gly-Tyr. Each subgroup is activated by dual phospho- 
rylation of threonine and tyrosine residues by MAP kinase 
kinases located upstream of the phosphorylation cascade. 
Activated MAP kinases, in rum, phosphorylate downstream go 
effectors ultimately leading to intracellular changes. 

The ERK signal transduction pathway is activated via 
tyrosine kinase receptors on the plasmalemma. When 
growth factors bind to tyrosine, they bind to noncatalytic, 
Src homology (SH) adaptor proteins (SH2-SH3-SH2) and a 65 
guanine nucleotide releasing protein (GNRP). GNRP 
reduces GTP and activates Ras proteins, members of the 
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large family of guanine nucleotide binding proteins 
(G-proteins). Activated Ras proteins bind to a protein kinase 
C-Raf-1 and activate the Raf-1 proteins. The activated Raf-1 
kinase subsequently phosphorylates MAP kinase kinase 
(MKK) which, in turn, activate ERKs. 

ERKs are proline-directed protein kinases which phos- 
phorylate Ser/Thr-Pro motifs. In fact, cytoplasmic phospho- 
lipase A2 (cPLA2) and transcription factor Elk-1 arc sub- 
strates of ERKs. The ERKs phosphorylate Ser 305 of cPLA2 
thereby increasing its enzymatic activity and resulting in 
release of arachidonic acid and the formation of lysophos- 
pholipids from membrane phospholipids. Likewise, phos- 
phorylation of the transcription factor Elk-1 by ERK ulti- 
mately increases transcriptional activity. 

JNK is distantly related to the ERK and is similarly 
activated by dual phosphorylation of Thr and Tyr and by 
MKK4 (Davis R (1994) TIBS 19:470-473). The JNK signal 
transduction pathway is also initiated by ultraviolet light, 
osmotic stress, and the pro-inflammatory cytokines, TNF 
and IL-1. Phosphorylation of Ser 63 and Ser^ in the NH^ 
terminal domain of the transcription factor c-Jun increases 
transcriptional activity. 

p38 is a 41 kD protein containing 360-amino acids. Its 
dual phosphorylation is activated by the MKK3 and MKK4, 
heat shock, hyperosmolar medium, IL-1 or LPS endotoxin 
(Han J. et al (1994) Science 265:808-811). Sepsis produced 
by LPS is characterized by fever, chills, tachypnea, and 
tachycardia, and severe cases may result in septic shock 
which includes hypotension and multiple organ failure. 

Cells respond to LPS as a stress signal because it alters 
normal cellular processes and induces the release of sys- 
temic mediators such as TNF. CD14 is a 
glycosylphosphatidyl-inositol-anchored membrane glyco- 
protein which serves as a LPS receptor on the plasmalemma 
of monocytic cells. The binding of LPS to CD14 causes 
rapid protein tyrosine phosphorylation of the 44- and 42-/ 
40- kD isoforms of MAP kinases. Although they bind LPS, 
these MAP kinase isoforms do not appear to belong to the 
p38 subgroup. 

An detailed understanding of kinase pathways and signal 
transduction is beginning to reveal some mechanisms for 
interceding in the progression of inflammatory illnesses and 
of uncontrolled cell proliferation. The cDNAs, 
oligonucleotides, peptides and antibodies for the human 
kinases, which are the subject of this invention and are listed 
in Table 1, provide a plurality of tools for studying signalling 
cascades in various cells and tissues and for diagnosing and 
selecting inhibitors or drugs with the potential to intervene 
in various disorders or diseases in which altered kinase 
expression is implicated. The disorders or diseases include, 
but not limited to, human X-linked agammaglobulinemia, 
nonspherocytic hemolytic anemia, atherosclerosis, carcino- 
mas (breast, ovary, renal, squamous cell and prostate), 
diabetes, gliomas, glomerular disease, hepatomegaly, Kar- 
posi's sarcoma, lymphoblastic and myelogenous leukemias, 
myoglobinuria, peptic ulcer disease, psoriasis, pulmonary 
fibrosis, restenosis, and septic shock due to cholera, 
Clostridium difficile, E. coli and Shigella (Isselbacher K. J. 
et al (1994) Harrison's Principles of Internal Medicine, 
McGraw-Hill, New York City; Levitzki A. and A. Gazit 
(1995) Science 267:1782-88). 

SUMMARY OF THE INVENTION 

The subject invention provides unique polynucleotides 
(SEQ ID NOs 1-44) which have been identified as novel 
human kinases (kin). These partial cDNAs were identified 
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among the polynucleotides which comprise various Incyte 
cDNA libraries. 

The invention comprises polynucleotides which are 
complementary to the kin sequences (SEQ ID Nos 1-44). 

The invention also comprises the use of kin sequences to 
identify and obtain a full length human kinase cDNAs such 
as SEQ ID NO 45. 

The invention further comprises the use of oligomers 
from these kin sequences in a kinases kit which can be used 
to identify a disorder or disease with altered kinase expres- 
sion and provide a method for monitoring progress of a 
patient during drug therapy. 

Aspects of the invention include use of kin sequences or 
recombinant nucleic acids derived from them to produce 
purified peptides. Still further aspects of the invention use 
these purified peptides to identify antibodies or other mol- 
ecules with inhibitory activity toward a particular kinase, 
group of kinases or disease. 

In addition, the invention comprises the use of kin specific 
antibodies in assays to identify a disorder or disease with 
altered kinase expression and provides a method to monitor 
the progress of a patient during drug therapy. 

DESCRIPTION OF THE FIGURE 

FIGS. 1A and IB display the full length nucleotide 
sequence for human MAP kinase from stomach tissue (SEQ 
ID NO 45; Incyte Clone 214915E) and its predicted amino 
acid sequence. 

DETAILED DESCRIPTION OF THE 
INVENTION 

Definitions 

As used herein, the abbreviation for kinase in lower case 
(kin) refers to a gene, cDNA, RNAor nucleic acid sequence 
while the upper case version (KIN) refers to a protein, 
polypeptide, peptide, oligopeptide, or amino acid sequence. 

An "oligonucleotide" or "oligomer" is a stretch of nucle- 
otide residues which has a sufficient number of bases to be 
used in a polymerase chain reaction (PCR). These short 
sequences are based on (or designed from) genomic or 
cDNA sequences and are used to amplify, confirm, or reveal 
the presence of an identical, similar or complementary DNA 
or RNA in a particular cell or tissue. Oligonucleotides or 
oligomers comprise portions of a DNA sequence having at 
least about 10 nucleotides and as many as about 50 
nucleotides, preferably about 15 to 30 nucleotides. They are 
chemically synthesized and may be used as probes, 

"Probes" are nucleic acid sequences of variable length, 
preferably between at least about 10 and as many as about 
6,000 nucleotides, depending on use. They are used in the 
detection of identical, similar, or complementary nucleic 
acid sequences. Longer length probes are usually obtained 
from a natural or recombinant source, are highly specific and 
much slower to hybridize than oligomers. They may be 
single- or double-stranded and carefully designed to have 
specificity in PCR, hybridization membrane-based, or 
ELISA-like technologies. 

"Reporter" molecules are chemical moieties used for 
labelling a nucleic or amino acid sequence. They include, 
but are not limited to, radionuclides, enzymes, fluorescent, 
chemi-luminescent, or chromogenic agents. Reporter mol- 
ecules associate with, establish the presence of, and may 
allow quantification of a particular nucleic or amino acid 
sequence. 

A "portion" or "fragment" of a polynucleotide or nucleic 
acid comprises all or any part of the nucleotide sequence 
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having fewer nucleotides than about 6 kb, preferably fewer 
than about 1 kb which can be used as a probe. Such probes 
may be labelled with reporter molecules using nick 
translation, Klenow fill-in reaction, PCR or other methods 
5 well known in the art. After pretesting to optimize reaction 
conditions and to eliminate false positives, nucleic acid 
probes may be used in Southern, northern or in situ hybrid- . 
izations to determine whether DNA or RNA encoding the 
protein is present in a biological sample, cell type, tissue, 
10 organ or organism. 

"Recombinant nucleotide variants" are polynucleotides 
which encode a protein. They may be synthesized by making 
use of the "redundancy" in the genetic code. Various codon 
substitutions, such as the silent changes which produce 
15 specific restriction sites or codon usage-specific mutations, 
may be introduced to optimize cloning into a plasmid or 
viral vector or expression in a particular prokaryotic or 
eukaryotic host system, respectively. 
"Linkers" are synthesized palindromic nucleotide 
20 sequences which create internal restriction endonuclease 
sites for ease of cloning the genetic material of choice into 
various vectors. "Polylinkers" are engineered to include 
multiple restriction enzyme sites and provide for the use of 
both those enzymes which leave 5* and 3' overhangs such as 
25 BamHI, EcoRI, PstI, Kpnl and Hind III or which provide a 
blunt end such as EcoRV, SnaBI and Stul. 

"Control elements" or "regulatory sequences" are those 
nontranslated regions of the gene or DNA such as enhancers, 
promoters, introns and 3' untranslated regions which interact 
30 with cellular proteins to carry out replication, transcription, 
and translation. They may occur as boundary sequences or 
even split the gene. They function at the molecular level and 
along with regulatory genes are very important in 
development, growth, differentiation and aging processes, 
35 "Chimeric" molecules are polynucleotides or polypep- 
tides which are created by combining one or more of 
nucleotide sequences of this invention (or their parts) with 
additional nucleic acid sequence(s). Such combined 
sequences may be introduced into an appropriate vector and 
40 expressed to give rise to a chimeric polypeptide which may 
be expected to be different from the native molecule in one 
or more of the following kinase characteristics: cellular 
location, distribution, ligand-binding affinities, interchain 
affinities, degradation/turnover rate, signalling, etc. 
45 "Active" is that state which is capable of being useful or 
of carrying out some role. It specifically refers to those 
forms, fragments, or domains of an amino acid sequence 
which display the biologic and/or immunogenic activity 
characteristic of the naturally occurring kinase. 
50 "Naturally occurring KIN" refers to a polypeptide pro- 
duced by cells which have not been genetically engineered 
or which have been genetically engineered to produce the 
same sequence as that naturally produced. Specifically con- 
templated are various polypeptides which arise from post- 
55 transnational modifications. Such modifications of the 
polypeptide include but are not limited to acetylation, 
carboxylation, glycosylation, phosphorylation, lipidation 
and acylation. 

"Derivative" refers to those polypeptides which have been 
60 chemically modified by such techniques as ubiquitination, 
labelling (see above), pegylation (derivatization with poly- 
ethylene glycol), and chemical insertion or substitution of 
amino acids such as ornithine which do not normally occur 
in human proteins. 
65 "Recombinant polypeptide variant" refers to any polypep- 
tide which differs from naturally occurring KIN by amino 
acid insertions, deletions and/or substitutions, created using 
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recombinant DNA techniques. Guidance in determining 
which amino acid residues may be replaced, added or 
deleted without abolishing characteristics of interest may be 
found by comparing the sequence of KIN with that of related 
polypeptides and minimizing the number of amino acid 
sequence changes made in highly conserved regions. 

Amino acid "substitutions" are defined as one for one 
amino acid replacements. They are conservative in nature 
when the substituted amino acid has similar structural and/or 



fidelity enzyme" may include mixtures of such enzymes and 
any other enzymes fitting the stated criteria, or reference to 
the method includes reference to one or more methods for 
obtaining cDNA sequences which will be known to those 
skilled in the art or will become known to them upon reading 
this specification. 

Before the present sequences, variants, formulations and 
methods for making and using the invention are described, 
it is to be understood that the invention is not to be limited 



chemical properties. Examples of conservative replacements 10 only to the particular sequences, variants, formulations or 

are substitution of a leucine with an isoleucine or valine, an methods described. The sequences, variants, formulations 

aspartate with a glutamate, or a threonine with a serine. and methodologies may vary, and the terminology used 

Amino acid "insertions" or "deletions" are changes to or herein is for the purpose of describing particular embodi- 

within an amino acid sequence. They typically fall in the ments. The terminology and definitions are not intended to 

range of about 1 to 5 amino acids. The variation allowed in 15 be limiting since the scope of protection will ultimately 



a particular amino acid sequence may be experimentally 
determined by producing the peptide synthetically or by 
systematically making insertions, deletions, or substitutions 
of nucleotides in the kin sequence using recombinant DNA 
techniques. 

A "signal or leader sequence" is a short amino acid 
sequence which or can be used, when desired, to direct the 
polypeptide through a membrane of a cell. Such a sequence 
may be naturally present on the polypeptides of the present 



depend upon the claims. 

DESCRIPTION OF THE INVENTION 

The present invention provides for purified partial protein 
20 kinase cDNAs which were expressed in various human 
tissues and isolated therefrom. These sequences were iden- 
tified by their similarity to published or known open reading 
frames or untranslated control regions. Since protein kinases 
are associated with basic cellular processes such as cell 



invention or provided from heterologous sources by recom- 25 pro iif erat j on> differentiation and cell signalling, these nucle- 



binant DNA techniques. 

An "oligopeptide" is a short stretch of amino acid residues 
and may be expressed from an oligonucleotide. It may be 
functionally equivalent to and either the same length as or 
considerably shorter than a "fragment "portion or 
"segment" of a polypeptide. Such sequences comprise a 
stretch of amino acid residues of at least about 5 amino acids 
and often about 17 or more amino acids, typically at least 
about 9 to 13 amino acids, and of sufficient length to display 
biologic and/or immunogenic activity. 

An "inhibitor" is a substance which retards or prevents a 
chemical or physiological reaction or response. Common 
inhibitors include but are not limited to antisense molecules, 
antibodies, antagonists and their derivatives. 



30 



35 



otide sequences are useful in the characterization of and 
delineation of normal and abnormal processes. Kinase 
nucleotide sequences are useful in diagnostic assays used to 
evaluate the role of a specific kinase in normal, diseased, or 
therapeutically treated cells. 

Purified kinase nucleotide sequences have numerous 
applications in techniques known to those skilled in the art 
of molecular biology. These techniques include their use as 
hybridization probes, for chromosome and gene mapping, in 
PCR technologies, in the production of sense or antisense 
nucleic acids, in screening for new therapeutic molecules, 
etc. These examples are well known and are not intended to 
be limiting. Furthermore, the nucleotide sequences disclosed 



tiDooies antagonists ana tneir aenvaiivc*. molecular biology techniques that 

A "standard" is a quantitative or qualitative measurement 40 / devekmed. provided the new technioues 
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for comparison. Preferably, it is based on a statistically 
appropriate number of samples and is created to use as a 
basis of comparison when performing diagnostic assays, 
running clinical trials, or following patient treatment pro- 
files. The samples of a particular standard may be normal or 45 
similarly abnormal. 

"Animal" as used herein may be defined to include 
human, domestic (cats, dogs, etc), agricultural (cows, 
horses, sheep, goats, chicken, fish, etc) or test species (frogs, 
mice, rats, rabbits, simians, etc). 

"Disorders or diseases" in which altered kinase activity 
have been implicated specifically include, but are not limited 
to, human X-linked agammaglobulinemia, nonspherocytic 
hemolytic anemia, atherosclerosis, carcinomas (breast, 
ovary, renal, squamous cell and prostate), diabetes, gliomas, 
glomerular disease, hepatomegaly, Karposi's sarcoma, lym- 
phoblastic and myelogenous leukemias, myoglobinuria, 
peptic ulcer disease, psoriasis, pulmonary fibrosis, 
restenosis, and septic shock due to cholera, Clostridium 
difficile, E. coli and Shigella. 

Since the list of technical and scientific terms cannot be all 
encompassing, any undefined terms shall be construed to 
have the same meaning as is commonly understood by one 
of skill in the art to which this invention belongs. 
Furthermore, the singular forms "a", "an" and "the" include 
plural referents unless the context clearly dictates otherwise. 
For example, reference to a "restriction enzyme" or a "high 
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have not yet been developed, provided the new techniques 
rely on properties of nucleotide sequences that are currently 
known, including but not limited to such properties as the 
triplet genetic code and specific base pair interactions. 

As a result of the degeneracy of the genetic code, a 
multitude of kinase-encoding nucleotide sequences may be 
produced and some of these will bear only minimal homol- 
ogy to the endogenous sequence of any known and naturally 
occurring kinase. This invention has specifically contem- 
plated each and every possible variation of nucleotide 
sequence that could be made by selecting combinations 
based on possible codon choices. These combinations are 
made in accordance with the standard triplet genetic code as 
applied to the nucleotide sequence of naturally occurring 
kinases, and all such variations are to be considered as being 
specifically disclosed. 

Although the kinase nucleotide sequences and their 
derivatives or variants are preferably capable of identifying 
the nucleotide sequence of the naturally occurring kinase 
under optimized conditions, it may be advantageous to 
produce kinase-encoding nucleotide sequences possessing a 
substantially different codon usage. Codons can be selected 
to increase the rate at which expression of the peptide occurs 
in a particular prokaryotic or eukaryotic expression host in 
accordance with the frequency with which particular codons 
are utilized by the host. Other reasons for substantially 
altering the nucleotide sequence encoding the kinase without 
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altering the encoded amino acid sequence include the pro- 
duction of RNA transcripts having more desirable 
properties, such as a longer half-life, than transcripts pro- 
duced from the naturally occurring sequence. 

Nucleotide sequences encoding a kinase may be joined to 
a variety of other nucleotide sequences by means of well 
established recombinant DNA techniques (Sambrook J. et al 
(1989) Molecular Cloning: A Laboratory Manual, Cold 
Spring Harbor Laboratory, Cold Spring Harbor, N.Y.; or 



Although the restriction and ligation reactions are carried 
out simultaneously, the requirements for extension, immo- 
bilization and two rounds of PCR and purification prior to 
sequencing render the method cumbersome and time con- 
suming. 

Parker J. D. et al (1991; Nucleic Acids Res 19:3055-60), 
teach walking PCR, a method for targeted gene walking 
which permits retrieval of unknown sequence. Promoter- 
Finder™ is a new kit available from Clontech (Palo Alto, 



jSJL F. M. el al Current Protocol in Molecular » Calif.) which , uses PCR -d [V™™J™± b ™£l«. 



Biology, John Wiley & Sons, New York City). Useful 
sequences for joining to the kinase include an assortment of 
cloning vectors such as plasmids, cosmids, lambda phage 
derivatives, phagemids, and the like. Vectors of interest 
include vectors for replication, expression, probe generation, 
sequencing, and the like. In general, vectors of interest may 
contain an origin of replication functional in at least one 
organism, convenient restriction endonuclease sensitive 
sites, and selectable markers for one or more host cell 
systems. 

PCR as described in U.S. Pat. Nos. 4,683,195; 4,800,195; 
and 4,965,188 provides additional uses for oligonucleotides 
based upon the kinase nucleotide sequence. Such oligomers 
are generally chemically synthesized, but they may be of 
recombinant origin or a mixture of both. Oligomers gener- 
ally comprise two nucleotide sequences, one with sense 
orientation (5*-*3 ( ) and one with antisense (3* to 5') 
employed under optimized conditions for identification of a 
specific gene or diagnostic use. The same two oligomers, 
nested sets of oligomers, or even a degenerate pool of 
oligomers may be employed under less stringent conditions 
for identification and/or quantitation of closely related DNA 
or RNA sequences. 

Full length genes may be cloned utilizing partial nucle- 
otide sequence and various methods known in the art. 
Gobinda et al (1993; PCR Methods Applic 2:318-22) dis- 
close "restriction-site PCR" as a direct method which uses 
universal primers to retrieve unknown sequence adjacent to 
a known locus. First, genomic DNA is amplified in the 
presence of primer to linker and a primer specific to the 
known region. The amplified sequences are subjected to a 
second round of PCR with the same linker primer and 
another specific primer internal to the first one. Products of 
each round of PCR are transcribed with an appropriate RNA 
polymerase and sequenced using reverse transcriptase. 
Gobinda et al present data concerning Factor DC for which 
they identified a conserved stretch of 20 nucleotides in the 
3' noncoding region of the gene. 
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walk in genomic DNA, Nested primers and special Promot- 
erFinder libraries are used to detect upstream sequences 
such as promoters and regulatory elements. This process 
avoids the need to screen libraries and is useful in finding 
intron/exon junctions. 

Another new PCR method, "Improved Method for 
Obtaining Full Length cDNA Sequences" by Guegler et al, 
patent application Ser. No 08/487,112, filed Jun. 7, 1995 and 
hereby incorporated by reference, employs XL-PCR 
(Perkin-Elmer, Foster City, Calif.) to amplify and extend 
partial nucleotide sequence into longer pieces of DNA. This 
method was developed to allow a single researcher to 
process multiple genes (up to 20 or more) at one time and to 
obtain an extended (possibly full-length) sequence within 
6-10 days. This new method replaces methods which use 
labelled probes to screen plasmid libraries and allow one 
researcher to process only about 3-5 genes in 14-40 days. 

In the first step, which can be performed in about two 
days, any two of a plurality of primers are designed and 
synthesized based on a known partial sequence. In step 2, 
which takes about six to eight hours, the sequence is 
extended by PCR amplification of a selected library. Steps 3 
and 4, which take about one day, are purification of the 
amplified cDNA and its ligation into an appropriate vector. 
Step 5, which takes about one day, involves transforming 
and growing up host bacteria. In step 6, which takes approxi- 
mately five hours, PCR is used to screen bacterial clones for 
extended sequence. The final steps, which take about one 
day, involve the preparation and sequencing of selected 
clones. 

If the full length cDNA has not been obtained, the entire 
procedure is repeated using either the original library or 
some other preferred library. The preferred library may be 
one that has been size-selected to include only larger cDNAs 
or may consist of single or combined commercially avail- 
able libraries, eg. lung, liver, heart and brain from Gibco/ 
BRL(Gaithersburg, Md.). The cDNA library may have been 
prepared with oligo (dT) or random priming. Random 



Inverse PCR is the first method to report successful 50 P™ed libraries are preferred in that they will contain i more 
acquisition of unknown sequences starting with primers sequences which contain 5 ends of genes. A randomly 
based on a known region (Triglia T. et al (1988) Nucleic primed library may be particularly useful if an ohgo 
Acids Res 16:8186). The method uses several restriction library does not yield a complete gene. It must be noted that 
enzymes to generate a suitable fragment in the known region the larger and more complex the protein, the less likely it is 

of a gene. T^e fragment is then circularized by intramolecu- 55 that the complete gene will be found in a smgle plasmid. 



A new method for analyzing either the size or the nucle- 
otide sequence of PCR products is capillary electrophoresis. 
Systems for rapid sequencing are available from Perkin 
Elmer (Foster, City Calif.), Beckman Instruments (Fullerton, 
60 Calif.), and other companies. Capillary sequencing employs 
flowable polymers for electrophoretic separation, four dif- 
ferent fluorescent dyes (one for each nucleotide) which are 
laser activated, and detection of the emitted wavelengths by 

ur**\ av» uuitu v vjuymwa w ai _ r a charge coupled devise camera. Output/light intensity is 

r^uke7mS and ligations 65 converted to electrical signal using appropriate software (eg. 

toplace an engineered double-stranded sequence into an Genotyper™ and Sequence Navigators™ from Perkin 
unknown portion of the DNA molecule before PCR. Elmer) and the entire process from loading of samples to 



lar ligation and used as a PCR template. Divergent primers 
are designed from the known region. The multiple rounds of 
restriction enzyme digestions and ligations that are neces- 
sary prior to PCR make the procedure slow and expensive 
(Gobinda et al, supra). 

Capture PCR (Lagerstrom M. et al (1991) PCR Methods 
Applic 1:111-19) is a method for PCR amplification of DNA 
fragments adjacent to a known sequence in human and YAC 
DNA. As noted by Gobinda et al (supra), capture PCR also 
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computer analysis and electronic data display is computer particular therapeutic treatment regimejt * 
controlled. Capillary electrophoresis provides greater reso- an.mal stud.es, >n cl.n.cal tnals, or m 
lution and is many times faster than standard gel based ment of an individual pahent. F^^« ^ 
procedures. It is p^ticularly suited to the sequencing of must be established for use as a b as* o [ 
small pieces of DNA which might be present in limited s Second, samples from the ammals o P'^jf^gStJ* 
amounte in a particular sample. The reproducible sequenc- disorder or d^ase are combined with the nucleotide 
ine of up to 350 bp of M13 phage DNA in 30 min has been sequence to evaluate the deviation from the standard or 
^rd^u^rtinez M. C et al (1993) Anal Chem normal profile Tnird an «^.«£^ C ^"L» 
65-2851^ administered, and a treatment profile is generated. The assay 
Another aspect of the subject invention is to provide for 10 is evaluated to determine whether the P^^" 
kinase hybridization probes which are capable of hybridiz- toward or returns to the standard pattern. Successive treat- 
ing with naturally occurring nucleotide sequences encoding ment profiles may be used to show the efficacy of treatment 
kinases. The stringency of the hybridization conditions will over a period of several days or several months, 
determine whether the probe identifies only the native xh e nucleotide sequence for any particular kinase (SEQ 
nucleotide sequence of that specific kinase or sequences of jd f^rjs 1^5) can also be used to generate probes for 
closely related molecules. If degenerate kinase nucleotide mapping the native genomic sequence. The sequence may be 
sequences of the subject invention are used for the detection ma pped to a particular chromosome or to a specific region 
of related kinase encoding sequences, they should preferably of me c h romos0 me using well known techniques. These 
contain at least 50% of the nucleotides of the sequences i nc \ n dc in situ hybridization to chromosomal spreads 
presented herein. Hybridization probes of the subject inven- n/erma et al (1988) Human Chromosomes: A Manual of 
tion may be derived from the nucleotide sequences of the 20 ^ Techni Perga mon Press, New York City), flow- 
SEQ ID NOs 1-44, or from surrounding or included chromosomal preparations, or artificial chromosome 
genomic sequences comprising undated «£™ s ^ constructions such as yeast artificial chromosomes (YACs), 
promoters enhancers and introns. Such hybridization probes chromosomes (BACs), bacterial PI con- 

may be labelled with appropriate reporter molecules. Means oacienai diunuai wmu V nM A / ;Km . c 

for producing specific hybridization probes for kinases 25 strucUons or single chromosome cDNA libraries, 
include oligolabeUing, nick translation, end-labelling or In situ hybridization of chromosomal preparations and 
PCR amplification using a labelled nucleotide. Alternatively, physical mapping techniques such as linkage analysis using 
the cDNA sequence may be cloned into a vector for the established chromosomal markers are invaluable in extend- 
production of mRNA probe. Such vectors are known in the ing genetic maps. Examples of genetic maps can be found in 
art, are commercially available, and may be used to synthe- 30 the 1994 Genome Issue of Science (265:1981f). Often the 
size RNA probes invitro by addition of an appropriate RNA placement of a gene on the chromosome of another mam- 
polymerase such as T7, T3 or SP6 and labelled nucleotides. malian species may reveal associated markers even if the 
A number of companies (such as Pharmacia Biotech, number or arm of a particular human chromosome is not 
Piscataway, N J.; Promega, Madison, Wis.; US Biochemical known. New partial nucleotide sequences can be assigned to 
Corp Cleveland, Ohio; etc.) supply commercial kits and 35 chromosomal arms, or parts thereof, by physical mapping, 
protocols for these procedures. This provides valuable information to investigators search- 
It is also possible to produce a DNAsequence, or portions ing for disease genes using positional cloning or other gene 
thereof, entirely by synthetic chemistry. Sometimes the discovery techniques Once a disease or syndrome such as 
source of information for producing this sequence comes ataxia telangiectasia (AI), has been crudely localized by 
from the known homologous sequence from closely related 40 genetic linkage to a particular J^^^^^^ 
organisms. After synthesis, the nucleic acid sequence can be AT to llq22-23 (Gatti et al (1988) Nature 336:577-580), 
used alone or joined with a preexisting sequence and any sequences mapping to that area may represent genes for 
inserted into one of the many available DNA vectors and further investigation. The nucleotide sequences of the sub- 
their respective host cells using techniques well known in ject invention may also be used to detect differences in the 
the art. Moreover, synthetic chemistry may be used to 45 chromosomal location of nucleotide sequences due to 
introduce specific mutations into the nucleotide sequence. translocation, inversion, etc. between normal and earner or 
Alternatively, a portion of sequence in which a mutation is affected individuals. 

desired can be synthesized and recombined with a portion of The partial nucleotide sequence encoding a particular 

an existing genomic or recombinant sequence. kinase may be used to produce an amino acid sequence using 

Hie kinase nucleotide sequences can be used individually, 50 well known methods of recombinant DNA technology, 

or in panels, in a diagnostic test or assay to detect disorder Goeddel (1990, Gene Expression Technology, Methods and 

or disease processes associated with abnormal levels of Enzymology, Vol 185, Academic Press, San Diego, Calif.) is 

kinase expression. The nucleotide sequence is added to a one among many publications which teach expression of an 

sample (fluid, cell or tissue) from a patient under hybridizing isolated, purified nucleotide sequence. The ammo acid or 

conditions. After an incubation period, the sample is washed 55 peptide may be expressed in a variety of host cefls, either 

with a compatible fluid which optionally contains a reporter prokaryotic or eukaryotic. Host cells may be from the same 

molecule which will bind the specific nucleotide. After the species from which the nucleotide sequence was derived or 

compatible fluid is rinsed off, the reporter molecule is from a different species .Advantages of Pacing an amino 

quantitated and compared with a standard for that fluid, cell acid sequence or peptide by recombmant DNA technology 

or tissue. If kinase expression is significantly different from 60 include obtaining adequate amounts for purification and the 

the standard, the assay indicates the presence of disorder or availability of simplified purification procedures, 

disease. The form of such qualitative or quantitative meth- Cells transformed with a kinase nucleotide sequence may 

ods may include northern analysis, dot blot or other mem- be cultured under conditions suitable for the expression and 

brane based technologies, dip stick, pin or chip technologies, recovery of peptide from cell culture. The peptide produced 

PCR EUSAs or other multiple sample format technologies. 65 by a recombinant cell may be secreted or may be contained 

This same assay, combining a sample with the nucleotide intraceUularly depending on the sequence itself and/or the 

sequence, is applicable in evaluating the efficacy of a vector used. In general, it is more convenient to prepare 
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recombinant proteins in secreted form, and this is accom- 
plished by ligating kin to a recombinant nucleotide sequence 
which directs its movement through a particular prokaryotic 
or eukaryotic cell membrane. Other recombinant construc- 
tions may join kin to nucleotide sequence encoding a 
polypeptide domain which will facilitate protein purification 
(Kroll D. J. et al (1993) DNA Cell Biol 12:441-53). 

Direct peptide synthesis using solid-phase techniques 
(Stewart et al (1969) Solid-Phase Peptide Synthesis, WH 
Freeman Co, San Francisco, Calif.; Merrifield J. (1963) J 
Am Chem Soc 85:2149-2154) is an alternative to recom- 
binant or chimeric peptide production. Automated synthesis 
may be achieved, for example, using Applied Biosystems 
431 A Peptide Synthesizer in accordance with the instruc- 



10 



an older procedure, the procedure presented in this applica- 
tion is exemplary of one currently being used by persons 
skilled in the art. For the purpose of providing an exemplary 
method, the tissue preparation, mRNA isolation and cDNA 
library construction described here is for the rheumatoid 
synovium library from which the Incyte Clones 191283 and 
192268 for ser/thr kinases were obtained. 

Rheumatoid synovial tissue was obtained from the hip 
joint removed from a 68 year old female with erosive, 
nodular rheumatoid arthritis. The tissue was frozen, ground 
to powder in a mortar and pestle, and lysed immediately in 
buffer containing guanidinium isothiocyanate. The lysate 
was centrifuged over a CsCl cushion (18 hrs at 25,000 rpm 
using a Beckman SW28 rotor and ultracentrifuge; Beckman 



431A repuoe ^^ T ^^^^^ Xv ^"^ T 15 instruments, Palo Alto, Calif.), ethanol precipitated, resus- 
tions provided by the manufacturer. Additionally a particular ™ , , / ^ Q ^ 



kinase sequence or any part thereof may be mutated during 
direct synthesis and combined using chemical methods with 
other kinase sequence(s) or a part thereof. This chimeric 
nucleotide sequence can also be placed in an appropriate 
vector and host cell to produce a variant peptide. 

Although an amino acid sequence or oligopeptide used for 
antibody induction does not require biological activity, it 
must be immunogenic. KIN used to induce specific anti- 



pended in water and DNase treated for 15 min at 37° C. The 
RNA was extracted with phenol chloroform and precipitated 
with ethanol. Polyadenylated messages were isolated using 
Qiagen Oligotex (QIAGEN Inc, Chatsworth, Calif.), and a 
20 custom cDNA library was constructed by Stratagene (La 
Jolla, Calif.). 

First strand cDNA synthesis was accomplished using an 
oligo (dT) primer/linker which also contained an Xhol 
restriction site. Second strand synthesis was performed 



bodies may have an amino acid sequence consisting of at ^ . & combiMlfellof DNA polymerase I,£. coli ligase and 
least five amino acids and preferably at least 10 ammo iacids^ ^ followed by the addition of an EcoRI linker to the 

Short stretches of amino acid sequence may be fused with F RT . - doub i e _ stran ded 



sequence may 

those of another protein such as keyhole limpet hemocyanin, 
and the chimeric peptide used for antibody production. 
Alternatively, the oligopeptide may be of sufficient length to 
contain an entire domain. 

Antibodies specific for KIN may be produced by inocu- 
lation of an appropriate animal with an antigenic fragment of 
the peptide. An antibody is specific for KIN if it is produced 
against an epitope of the polypeptide and binds to at least 
part of the natural or recombinant protein. Antibody pro- 
duction includes not only the stimulation of an immune 
response by injection into animals, but also analogous 



blunt ended cDNA. The EcoRI linked, double-stranded 
cDNA was then digested with Xhol restriction enzyme, 
extracted with phenol chloroform, and fractionated by size 
30 on Sephacryl S400. DNA of the appropriate size was then 
ligated to dephosphorylated Lambda Zap® arms 
(Stratagene) and packaged using Gigapack extracts 
(Stratagene). pBluescript (Stratagene) phagemid DNAs 
were excised en masse from the library. 
35 In the alternative, DNAs were purified using Miniprep 
Kits (Catalog #77468; Advanced Genetic Technologies 
Corporation, Gaithersburg, Md.). These kits provide a 

ivo r u,v VJ — > ~ 96- well format and enough reagents for 960 purifications. 

processes such as the production of synthetic antibodies, the ^ recommerK j e d protocol supplied with each kit has been 
screening of recombinant immunoglobulin libraries for ^ employed except f or tne following changes. First, the 96 
specific-binding molecules (Orlandi R. et al (1989) PNAS ^ eacfa mied ^ oaly x ml of sterile Terrific broth 

86:3833-3837, or Huse W. D. et al (1989) Science (LIFE TECHNOLGIES™, Gaithersburg, Md.) with carbe- 
256:1275-1281), or the in vitro stimulation of lymphocyte ^ ^ mg/L ( 2x Carb) and glycerol at 0.4%. After the 

populations. Current technology (Winter G. and Milstein C. ^ inoculated, the bacteria are cultured for 24 hours 

(1991) Nature 349:293-299) provides for a number of 45 and lysed ^ 6Q ^ of lysis buffer A centrifiigation step 
highly specific binding reagents based on the principles of , 290Q rpm for 5 mmutes ) & performed before the contents 
antibody formation. These techniques may be adapted to of ^ bk)ck afC added to me primary filter plate. The 
produce molecules which specifically bind kinase peptides. optional step of adding isopropanol to TRIS buffer is not 
Antibodies or other appropriate molecules generated against rou tinely performed. After the last step in the protocol, 
a specific immunogenic peptide fragment or oligopeptide 5Q samp i es are transferred to a Beckman 96-well block for 
can be used in Western analysis, enzyme-linked immunosor- slorage 
bent assays (ELISA) or similar tests to establish the presence n Scquencmg of cDNA Clones 

of or to quantitate amounts of kinase active in normal, The cDNA inserts from random isolates of the rheumatoid 
diseased, or therapeutically treated cells or tissues. synovium or other appropriate library were sequenced in 

The examples below are provided to illustrate the subject ^ Methods f or DNAsequencing are well known in the art 



invention. These examples are provided by way of illustra- 
tion and are not included for the purpose of limiting the 
invention. 

EXAMPLES 
I cDNA Library Construction 

The kinase sequences of this application (Table 1) were 
first identified among the sequences comprising various 
libraries. Technology has advanced considerably since the 
first cDNA libraries were made. Many small variations in 



and employ such enzymes as the Klenow fragment of DNA 
polymerase I, SEQUENASE® (US Biochemical Corp) or 
Taq polymerase. Methods to extend the DNA from an 
oligonucleotide primer annealed to the DNA template of 
60 interest have been developed for both single- and double- 
stranded templates. Chain termination reaction products 
were separated using electrophoresis and detected via their 
incorporated, labelled precursors. Recent improvements in 

un>i tui^ uuiaiivo rruv U...UV - mechanized reaction preparation, sequencing and analysis 

bothX^icr^rnd m^t^'^'bm^V^d over 65 have permitted expansion in the number of sequences that 
time, and these have improved both the efficiency and safety can be determined per day. Preferably the process is auto- 
of the process. Although the cDNAs could be obtained using mated with machines such as the Hamilton Micro Lab 2200 
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(Hamilton, Reno, Nev.), Peltier Thermal Cycler (PTC200; 
MJ Research, Watertown Mass.) and the Applied Biosys- 
tems Catalyst 800 and 377 and 373 DNA sequencers. 

The quality of any particular cDNA library may be 
determined by performing a pilot scale analysis of 192 
cDNAs and checking for percentages of clones containing 
vector, lambda or E. coli DNA, mitochondrial or repetitive 
DNA, and clones with exact or homologous matches -to 
public databases. The number of unique sequences — those 
having no known match in any available database — were 
recorded. 

Ill Homology Searching of cDNA Clones and Their 
Deduced Proteins 

Each sequence so obtained was compared to sequences in 
GenBank using a search algorithm developed by Applied 
Biosystems and incorporated into the INHERIT™ 670 
Sequence Analysis System. In this algorithm, Pattern Speci- 
fication Language (TRW Inc, Los Angeles, Calif.) was used 
to determine regions of homology. The three parameters that 
determine how the sequence comparisons run were window 
size, window offset, and error tolerance. Using a combina- 
tion of these three parameters, the DNA database was 
searched for sequences containing regions of homology to 
the query sequence, and the appropriate sequences were 
scored with an initial value. Subsequently, these homolo- 
gous regions were examined using dot matrix homology 
plots to distinguish regions of homology from chance 
matches. Smith-Waterman alignments were used to display 
the results of the homology search. 

Peptide and protein sequence homologies were ascer- 
tained using the INHERIT™ 670 Sequence Analysis System 
in a way similar to that used in DNA sequence homologies. 
Pattern Specification Language and parameter windows 
were used to search protein databases for sequences con- 
taining regions of homology which were scored with an 
initial value. Dot-matrix homology plots were examined to 
distinguish regions of significant homology from chance 
matches. 

Alternatively, BLAST, which stands for Basic Local 
Alignment Search Tool, is used to search for local sequence 
alignments (Altschul S. F. (1993) J Mol Evol 36:290-300; 
Altschul, S. F. et al (1990) J Mol Biol 215:403-10). BLAST 
produces alignments of both nucleotide and amino acid 
sequences to determine sequence similarity. Because of the 
local nature of the alignments, BLAST is especially useful 
in determining exact matches or in identifying homologs. 
While it is useful for matches which do not contain gaps, it 
is inappropriate for performing motif-style searching. The 
fundamental unit of BLAST algorithm output is the High- 
scoring Segment Pair (HSP). 

An HSP consists of two sequence fragments of arbitrary 
but equal lengths whose alignment is locally maximal and 
for which the alignmentBLAST approach is to look thresh- 
old or cutoff score set by the user. The BLAST approach is 
to look for HSPs between a query sequence and a database 
sequence, to evaluate the statistical significance of any 
matches found, and to report only those matches which 
satisfy the user-selected threshold of significance. The 
parameter E establishes the statistically significant threshold 
for reporting database sequence matches. E is interpreted as 
the upper bound of the expected frequency of chance 
occurrence of an HSP (or set of HSPs) within the context of 
the entire database search. Any database sequence whose 
match satisfies E is reported in the program output. 

All the kinase molecules presented in this application 
were examined using INHERIT. Although their identifica- 
tion was based on the criteria above, their homology to 
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known kinase molecules and name are subject to change 
when additional computer analysis against additional or 
more recent database information is employed. For example, 
whereas the first two kinases in Table 1 were initially 
identified as unique Incyte clones, homologous mouse and 
human kinases are now known. In other cases, additional 
sequence information has become available and its review 
against the known databases has precipitated a name change. 
Occasionally a clone number will also disappear from the 
LIFESEQ™ database (Incyte Pharmaceuticals Inc, Palo 
Alto, Calif.). This situation generally arises during the 
regular review of clones and assembly of contiguous 
sequences. 

IV Extension of cDNAs to Full Length 

The kinase sequences presented here can be used to 
design oligonucleotide primers for the extension of the 
cDNAs to full length. In fact, the partial map kinase cDNA 
sequence (SEQ ID NO 38) initially identified in Incyte clone 
214915 among the sequences comprising the human stom- 
ach cell library was extended to full length as shown in "A 
Novel Human Map Kinase Homolog" by Hawkins et al. 
Incyte Docket PF-036P, filed on Jun. 28, 1995, incorporated 
herein by reference. The coding region of this full length 
sequence (SEQ ID NO 45; Incyte Clone 214915E) begins at 
nucleotide 58 and ends at nucleotide 1156. 

Primers are designed based on known sequence; one 
primer is synthesized to initiate extension in the antisense 
direction (XLR) and the other to extend sequence in the 
sense direction (XLF). The primers allow the sequence to be 
extended "outward" generating amplicons containing new, 
unknown nucleotide sequence for the gene of interest. The 
primers may be designed using Oligo 4.0 (National Bio- 
sciences Inc, Plymouth, Minn.), or another appropriate 
program, to be 22-30 nucleotides in length, to have a GC 
content of 50% or more, and to anneal to the target sequence 
at temperatures about 68°-72° C. Any stretch of nucleotides 
which would result in hairpin structures and primer-primer 
dimerizations was avoided. 

The stomach cDNA library was used as a template, and 
XLR-AAG ACA TCC AGG AGC CCA ATG AC and 
XLF-AGG TGA TCC TCA GCT GGA TGC AC primers 
were used to extend and amplify the 214915 sequence. By 
following the instructions for the XL-PCR kit and thor- 
oughly mixing the enzyme and reaction mix, high fidelity 
amplification is obtained. Beginning with 25 pMol of each 
primer and the recommended concentrations of all other 
components of the kit, PCR is performed using the Peltier 
Thermal Cycler (PTC200; MJ Research, Watertown, Mass.) 
and the following parameters: 

Step 1 94° C. for 60 sec (initial denaturation) 

Step 2 94° C. for 15 sec 

Step 3 65° C. for 1 min 

Step 4 68° C. for 7 min 

Step 5 Repeat step 2-4 for 15 additional cycles 

Step 6 94° C. for 15 sec 

Step 7 65° C. for 1 min 

Step 8 68° C. for 7 min+15 sec/cycle 

Step 9 Repeat step 6-8 for 11 additional cycles 

Step 10 72° C for 8 min 

Step 11 4° C. (and holding) 

At the end of 28 cycles, 50 $A of the reaction mix was 
removed; and the remaining reaction mix was run for an 
additional 10 cycles as outlined below: 

Step 1 94° C. for 15 sec 

Step 2 65° C. for 1 rain 
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Step 3 68° C. for (10 min+15 sec)/cycle 
Step 4 Repeat step 1-3 for 9 additional cycles 
Step 5 72° C. for 10 min 

A 5-10 /d aliquot of the reaction mixture is analyzed by 
electrophoresis on a low concentration (about 0.6-0.8%) 
agarose mini -gel to determine which reactions were suc- 
cessful in extending the sequence. Although all extensions 
potentially contain a full length gene, some of the largest 
products or bands are selected and cut out of the gel. Further 
purification involves using a commercial gel extraction 
method such as QIAQuick™ (QIAGEN Inc). After recovery 
of the DNA, Klenow enzyme is used to trim single-stranded, 
nucleotide overhangs creating blunt ends which facilitate 
religation and cloning. 

After ethanol precipitation, the products axe redissolved in 
13 /d of ligation buffer. Then, 1 /d T4-DNA ligase (15 units) 
and 1 [A T4 polynucleotide kinase are added, and the mixture 
is incubated at room temperature for 2-3 hours or overnight 
at 16° C. Competent E. coli cells (in 40 /d of appropriate 
media) are transformed with 3 /d of ligation mixture and 
cultured in 80 /d of SOC medium (Sambrook J. et al, supra). 
After incubation for one hour at 37° C, the whole transfor- 
mation mixture is plated on Luria Bertani (LB)-agar 
(Sambrook J. et al, supra) containing 2xCarb. The following 
day, 12 colonies are randomly picked from each plate and 
cultured in 150 /d of liquid LB/2xCarb medium placed in an 
individual well of an appropriate, commercially-available, 
sterile 96-well microliter plate. The following day, 5 /d of 
each overnight culture is transferred into a non-sterile 
96-well plate and after dilution 1:10 with water, 5 /d of each 
sample is transferred into a PCR array. 

For PCR amplification, 15 (A of concentrated PCR reac- 
tion mix (1.33x) containing 0.75 units of Taq polymerase, a 
vector primer and one or both of the gene specific primers 
used for the extension reaction are added to each well. 
Amplification is performed using the following conditions: 

Step 1 94° C. for 60 sec 

Step 2 94° C. for 20 sec 

Step 3 55° C. for 30 sec 

Step 4 72° C. for 90 sec 

Step 5 Repeat steps 2-4 for an additional 29 cycles 
Step 6 72° C. for 180 sec 
Step 7 4° C. (and holding) 

Aliquots of the PCR reactions are run on agarose gels 
together with molecular weight markers. The sizes of the 
PCR products are compared to the original partial cDNAs, 
and appropriate clones are selected, ligated into plasmid and 
sequenced. 

V Diagnostic Assays Using Kinase Specific Oligomers 

In those cases where a specific disorder or disease (see 
definitions supra) is suspected to involve altered quantities 
of a particular kinase, oligomers may be designed to estab- 
lish the presence and/or quantity of mRNA expressed in a 
biological sample. There are several methods currently 
being used to quantitate the expression of a particular 
molecule. Most of these methods use radiolabeled (Melby 
P. C. et al 1993 J Immunol Methods 159:235-44) or bioti- 
nylated (Duplaa C. et al 1993 Anal Biochem 229-36) 
nucleotides, coamplification of a control nucleic acid, and 
standard curves onto which the experimental results are 
interpolated. For example, phosphorylase B kinase defi- 
ciency may manifest as hepatomegaly which is inherited as 
either an X-linked or autosomal recessive trait or myoglo- 
binuria whose inheritance is unknown. 

Oligomers for phosphorylase B kinase are first used in 
quantitative PCR to establish a normal range for expression 
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of phosphorylase B kinase. Then, these same oligomers are 
used with extracts of cells from patients with inherited 
phosphorylase B kinase deficiency. The information from 
such studies is used to define different inheritance patterns 
5 and to diagnose future patients displaying phosphorylase B 
kinase deficiency-like symptoms. In like manner, this same 
assay can be used to monitor progress of the patient as 
his/her physiological situation moves toward the normal 
range during therapy for the condition. 

VI Kinases Kit 

The kinases of the subject invention are used to produce 
a kinases kit for diagnosing disorders or diseases associated 
with altered kinase expression. This involves the designing 
a plurality of oligomers, one set of which is specific for each 
kinase or kinase regulatory sequence. Specificity in this case 

15 refers to sequence similarity, to the length of the nucleic acid 
molecule amplified, to cell or tissue type being screened or 
to the disorder or disease. These oligomers are combined 
with a biological sample obtained from a patient in a 
solution suflScient for PCR and amplified. The PCR products 

20 are examined first, to detect the expression of each kinase, 
and second to quantify the expression of each kinase. Kinase 
expression is compared with standard ranges for normal and 
abnormal expression. In the case(s) where kinase expression 
is altered, use of the kit has provided the physician with a 

25 named disorder or disease which can be treated or further 
investigated. 

A further use of the oligomers from the kinases kit is in 
a diagnostic assay of example V (above) used to monitor 
patient response to drug therapy. Once the disease has been 

30 named and a therapy chosen, the oligomers specific to the 
patient's disease may be used periodically to monitor the 
efficacy of the chosen therapy. In this case, the specific 
oligomers are combined with a biological sample from the 
patient in a solution sufficient for PCR and amplified. The 

35 PCR product is quantified and compared with a normal 
standard and with the pretreatment profile of the patient. If 
the kinase expression is tending toward normal, the therapy 
may be considered effective; if the expression is even more 
abnormal, therapy should be discontinued and an alternative 

40 treatment instituted. 

VII Sense or Antisense Molecules 

Knowledge of the correct cDNA sequence of any particu- 
lar kinase, its regulatory elements or parts thereof will 
enable its use as a tool in sense (Youssoufian H. and H. F. 

45 Lodish 1993) Mol Cell Biol 13:98-104) or antisense 
(Eguchi et al (1991) Annu Rev Biochem 60:631-652) tech- 
nologies for the investigation of gene function. 
Oligonucleotides, from genomic or cDNAs, comprising 
either the sense or the antisense strand of the cDNA 

50 sequence can be used in vitro or in vivo to inhibit expression. 
Such technology is now well known in the art, and oligo- 
nucleotides or other fragments can be designed from various 
locations along the sequences. 
The gene of interest can be turned off in the short term by 

55 transfecting a cell or tissue with expression vectors which 
will flood the cell with sense or antisense sequences until all 
copies of the vector are disabled by endogenous nucleases. 
Stable transfection of appropriate germ line cells or prefer- 
ably a zygote with a vector containing the fragment will 

60 produce a transgenic organism (U.S. Pat. No. 4,736,866, 12 
Apr. 1988), which produces enough copies of the sense or 
antisense sequence to significandy compromise or entirely 
eliminate normal activity of the particular kinase gene. 
Frequently, the function of the gene can be ascertained by 

65 observing behaviors such as lethality, loss of a physiological 
pathway, changes in morphology, etc. at the intracellular, 
cellular, tissue or organismal level. 
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In addition to using fragments constructed to interrupt of interest. They include MMTV, SV40, andmetaUothionine 

transcription of the open reading frame, modifications of promoters for CHO cells; trp, lac, tac and T7 promoters for 

E ene expression can be obtained by designing amisense bacterial hosts; and alpha factor, alcohol oxidase and PGH 

sequences to promoters, enhancers, introns, or even to promoters for yeast. In addition, transcription enhancers 

trans-acting regulatory genes. Similarly, inhibition can be 5 such as the rous sarcoma virus (RSV) enhancer, may be used 

achieved using Hogeboom base-pairing methodology, also in mammalian host cells. Once homogeneous cultures of 

known as "triple helix" base pairing. recombinant cells are obtained through standard culture 

Vin Expression of Kinases methods, large quantities of recombinant^ produced peptide 

Expression of the kinases may be accomplished by sub- can be recovered from the conditioned medium and ana- 
cloning the cDNAs into appropriate vectors and transfecting 10 lyzed using methods known in the art. 
the vectors into host cells. In some cases, the cloning vector IX Isolation of Recombinant KIN 
previously used for the generation of the tissue library also KIN may be expressed as a recombinant protein with one 
provides for direct expression of kinase sequences in E. coli. or more additional polypeptide domains added to facilitate 
Upstream of the cloning site, this vector contains a promoter protein purification. Such punfication facilitating domains 
for B-galactosidase, followed by sequence containing the 15 include, but are not limited to, metal chelating peptides such 
amino-terminal Met and the subsequent 7 residues of as histidine-tryptophan modules that allow purification on 
B-galactosidase. Immediately following these eight residues immobilized metals, protein A domains that allow punfica- 
is a bacteriophage promoter useful for transcription and a tion on immobilized immunoglobulin, and the domain uti- 
linker containing a number of unique restriction sites. lized in the FLAGS extension/affinity purification system 

Induction of an isolated, transfected bacterial strain with 20 (Immunex Corp, Seattle, Wash.). The inclusion of a cleav- 

IPTG using standard methods will produce a fusion protein able linker sequence such as Factor XA or enterokinase 

corresponding to the first seven residues of B-galactosidase, (Invilrogen) between the purification domain and the km 

about 5 to 15 residues which correspond to linker, and the sequence may be useful to facilitate expression of KIN. 

peptide encoded within the kinase cDNA. Since cDNA X Testing for Kinase Activity 

clone inserts are generated by an essentially random process, 25 The sequences in this application represent many different 

there is one chance in three that the included cDNA will lie domains of different kinase families. These domains (and 

in the correct frame for proper translation. If the cDNA is not subdomains as detailed in the background of the invention) 

in the proper reading frame, it can be obtained by deletion may be utilized: 1) individually for the production of 

or insertion of the appropriate number of bases by well antibodies, 2) in functional groups (eg. to span a membrane), 

known methods including in vitro mutagenesis, digestion 30 and 3) as interchangable, usable parts of a chimeric kinase, 

with exonuclease III or mung bean nuclease, or oligonucle- The various partial cDNA sequences of this application 

otide linker inclusion represent the different kinase domains of the various fami- 

The kinase cDNA can be shuttled into other vectors. lies (Hardie G. and Hanks S., supra), and they may be 

known to be useful for expression of protein in specific recombined in numerous ways to produce chimenc nucleic 

hosts. Oligonucleotide linkers containing cloning sites as 35 acid molecules. For example, a known full length kinase 

well as a stretch of DNA sufficient to hybridize to the end of such as the human map kinase of this application (Seq ID No 

the target cDNA (25 bases) can be synthesized chemically 45) may be used to swap related portions of the nucleic acid 

by standard methods. These primers can then used to sequence, analogous to domains or subdomains of MAP 

amplify the desired gene fragments by PCR. The resulting kinase polypeptides. The chimeric nucleotides, so produced, 

fragments can be digested with appropriate restriction 40 may be introduced into prokaryotic host cells (as reviewed 

enzymes under standard conditions and isolated by gel inStrosbergA. D. and MarulloS. (1992) Trends PharmaSci 

electrophoresis. Alternatively, similar gene fragments can be 13:95-98) or eukaryotic host cells. These host cells are then 

produced by digestion of the cDNA with appropriate restric- employed in procedures to determine what molecules acti- 

tion enzymes and filling in the missing gene sequence with vate the kinase or what molecules are activated by a kinase 

chemically synthesized oligonucleotides. Partial nucleotide 45 Such activating or activated molecules may be of 

sequence from more than one gene can be ligated together extracellular, intracellular, biologic or chemical origin, 

and cloned in appropriate vectors to optimize expression. An example of a test system, in this case for protein 

Suitable expression hosts for such chimeric molecules tyrosine kinases, can be based on the interaction of protein 

include but are not limited to mammalian cells such as tyrosine kinases with chemokine receptors (Taniguchi X 

Chinese Hamster Ovary (CHO) and human 293 cells, insect 50 (1995) Science 268:251-255). These receptors are capable 

cells such as Sf9 cells, yeast cells such as Saccharomyces of activating a variety of nonreceptor protein tyrosine 

cerevisiae, and bacteria such as E. coli. For each of these cell kinases when stimulated by an extracellular chemokine. 

systems, a useful expression vector may also include an C-X-C chemokines such as platelet factor 4, mterleukin-8, 

origin of replication to allow propagation in bacteria and a connective tissue activating protein III. neutrophil activating 

selectable marker such as the B-Iactamase antibiotic resis- 55 peptide 2, are soluble activators of neutrophils, 

tance gene to allow selection in bacteria. In addition, the A standard measure of neutrophil activation involves 

vectors may include a second selectable marker such as the measuring the mobilization of Ca as part of the signal 

neomycin phosphotransferase gene to allow selection in transduction pathway. The experiment involves several 

transfected eukaryotic host cells. Vectors for use in eukary- steps. First, blood cells obtained from venipuncture are 

otic expression hosts may require RNA processing elements 60 fractionated by centrifugation on density gradients. Ennched 
such as 3' polyadenylation sequences if such are not part of populations of neutrophils are further fractionated on col- 

the cDNA of interest xitaos bv negative selection using antibodies specific for 

Additionally, some of the kinase vectors may contain other blood cells types. Next, neutrophils are transformed 
native promoters which will allow induction of gene expres- with an expression vector containing the kinase nucleic acid 

sion in human cells such as the 293 line mentioned above. 65 sequence of interest and preloaded fluorescent probe whose 
Other available promoters are host specific and may be emission characteristics have been altered by Ca binding, 
specifically combined with the coding region of the kinase Or in the alternative, the neutrophil is preloaded with the 
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purified kinase of interest and fluorescent probe. Then, when 
the cells are exposed to an appropriate chemokine, the 
chemokine receptor activates the kinase which, in turn, 
initiates Ca** flux. Ca ++ mobilization is observed and mea- 
sured using fluorometry as has been described in Grynk- 
ievicz G. et al (1985) J Biol Chem 260:3440, and McColl S. 
et al (1993) J Immunol 150:4550-4555, incorporated herein 
by reference. 

XI Identification of or Production of Kinase Specific Anti- 
bodies 

Purified KIN is used to screen a pre-existing antibody 
library or to raise antibodies, using either polyclonal or 
monoclonal methodology. For polyclonal antibody 
production, denatured peptide from the reverse phase HPLC 
separation is obtained in quantities up to 75 mg. This 
denatured protein can be used to immunize mice or rabbits 
using standard protocols; about 100 micrograms are 
adequate for immunization of a mouse, while up to 1 mg 
might be used to immunize a rabbit. In identifying mouse 
hybridomas, the denatured protein can be labelled and used 
to screen potential murine B-cell hybridomas for those 
which produce antibody. This procedure requires only small 
quantities of protein, such that 20 mg would be sufficient for 
labelling and screening of several thousand clones. 

For monoclonal antibody production, the amino acid 
sequence, as deduced from translation of the cDNA, is 
analyzed to determine regions of high immunogenicity. 
Peptides comprising appropriate hydrophilic regions are 
expressed from recombinant cDNAor synthesized and used 
in suitable immunization protocols to raise antibodies. 
Selection of appropriate epitopes is described by Ausubel F. 
M. et al (supra). The optimal amino acid sequences for 
immunization are usually located at the C-tenninus or 
N-terminus and in intervening, hydrophilic regions of the 
polypeptide which are likely to be exposed to the external 
environment when the protein is in its natural conformation. 

Typically, selected oligopeptides, about 15 residues in 
length, are synthesized using an Applied Biosystems Peptide 
Synthesizer Model 431 A using fmoc-chemistry and coupled 
to keyhole limpet hemocyanin (KLH, Sigma) by reaction 
with M-maleimidobenzoyl-N-hydroxysuccinimide ester 
(MBS; Ausubel F. M. et al, supra). If necessary, a cysteine 
may be introduced at the N-terminus of the peptide to permit 
coupling to KLH. Rabbits are immunized with the peptide- 
KLH complex in complete Freund's adjuvant. The resulting 
antisera are tested for antipeptide activity by binding the 
peptide to plastic, blocking with 1% bovine serum albumin, 
reacting with antisera, washing and reacting with labelled, 
affinity purified, specific goat anti-rabbit IgG. 

Hybridomas may also be prepared and screened using 
standard techniques. Hybridomas of interest are detected by 
screening with labelled KIN to identify those fusions pro- 
ducing the monoclonal antibody with the desired specificity. 
In a typical protocol, wells of plates (FAST; Becton- 
Dickinson, Palo Alto, Calif.) are coated during incubation 
with affinity purified, specific rabbit anti-mouse (or suitable 
anti-species Ig) antibodies at 10 mg/ml. The coated wells are 
blocked with 1% BSA, washed and incubated with super- 
natants from hybridomas. After washing the wells are incu- 
bated with labelled KIN at 1 mg/ml. Supernatants with 
specific antibodies bind more labelled KIN than is detectable 
in the background. Then clones producing specific antibod- 
ies are expanded and subjected to two cycles of cloning at 
limiting dilution. Cloned hybridomas are injected into 
pristane-treated mice to produce ascites, and monoclonal 
antibody is purified from mouse ascitic fluid by affinity 
chromatography on Protein A. Monoclonal antibodies with 



affinities of at least 10 8 /M, preferably 10 9 to 10 10 or stronger, 
will typically be made by standard procedures as described 
in Harlow and Lane (1988) Antibodies: A Laboratory 
Manual, Cold Spring Harbor Laboratory, Cold Spring 

5 Harbor, N.Y.; and in Goding (1986) Monoclonal Antibodies: 
Principles and Practice, Academic Press, New York City, 
both incorporated herein by reference. 
XII Diagnostic Assays Using KIN Specific Antibodies 
Particular KIN antibodies are useful for investigation of 

10 various disorders or diseases which may be characterized by 
differences in the amount or distribution of KIN. Given the 
usual role of the kinases, KIN might be expected to be 
upregulated (or downregulated) in its involvement in acti- 
vation of signal cascades. ' 

15 Diagnostic assays for KIN include methods utilizing the 
antibody and a reporter molecule to detect KIN in human 
body fluids, membranes, cells, tissues or extracts thereof. 
The antibodies of the present invention may be used with or 
without modification. Frequently, the antibodies will be 

20 labelled by joining them, either covalently or noncovalently, 
with a substance which provides for a detectable signal. A 
wide variety of reporter molecules and conjugation tech- 
niques are known and have been reported extensively in 
both the scientific and patent literature. Suitable reporter 

25 molecules or labels include those radionuclides, enzymes, 
fluorescent, chemi-luminescent, or chromogenic agents pre- 
viously mentioned as well as substrates, cofactors, 
inhibitors, magnetic particles and the like. Patents teaching 
the use of such labels include U.S. Pat. Nos. 3,817,837; 

30 3,850,752; 3,939,350; 3,996,345; 4,277,437; 4,275,149; and 
4,366,241. Also, recombinant immuno-globulins may be 
produced as shown in U.S. Pat. No. 4,816,567, incorporated 
herein by reference. 
A variety of protocols for measuring soluble or 

35 membrane-bound KIN, using either polyclonal or mono- 
clonal antibodies specific for the protein, are known in the 
art. Examples include enzyme-linked immunosorbent assay 
(ELISA), radioimmunoassay (RLA) and fluorescent acti- 
vated cell sorting (FACS). A two-site monoclonal-based 

40 immunoassay utilizing monoclonal antibodies reactive to 
two non-interfering epitopes on KIN is preferred, but a 
competitive binding assay may be employed. These assays 
are described, among other places, in Maddox, D. E. et al 
(1983, J Exp Med 158:1211). 

45 XIII Purification of Native KIN Using Antibodies 

Native or recombinant protein kinases can be purified by 
immunoaflBnity chromatography using antibodies specific 
for that particular KIN. In general, an immunoaffinity col- 
umn is constructed by covalently coupling the anti-KIN 

50 antibody to an activated chromatographic resin. 

Polyclonal immunoglobulins are prepared from immune 
sera either by precipitation with ammonium sulfate or by 
purification on immobilized Protein A (Pharmacia Biotech). 
Likewise, monoclonal antibodies are prepared from mouse 

55 ascites fluid by ammonium sulfate precipitation or chroma- 
tography on immobilized Protein A. Partially purified immu- 
noglobulin is covalently attached to a chromatographic resin 
such as CnBr-activated Sepharose (Pharmacia Biotech). The 
antibody is coupled to the resin, the resin is blocked, and the 

60 derivative resin is washed according to the manufacturer's 
instructions. 

Such immunoaflBnity columns may be utilized in the 
purification of KIN by preparing a fraction from cells 
containing KIN in a soluble form. This preparation may be 
65 derived by solubilization of whole cells or of a subcellular 
fraction obtained via differential centrifugation (with or 
without addition of detergent) or by other methods well 
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known in the art. Alternatively, soluble KIN containing a 
signal sequence may be secreted in useful quantity into the 
medium in which the cells are grown. 

A soluble KIN-containing preparation is passed over the 
immunoaffinity column, and the column is washed under 5 
conditions that allow the preferential absorbance of KIN (eg, 
high ionic strength buffers in the presence of detergent). 
Then, the column is eluted under conditions that disrupt 
antibody/KIN binding (eg, a buffer of pH 2-3 or a high 
concentration of a chaotrope such as urea or thiocyanate 
ion), and KIN is collected. 10 

XIV Drug Screening 

This invention is particularly useful for screening thera- 
peutic compounds by using binding fragments of KIN in any 
of a variety of drug screening techniques. The molecules to 
be screened may be of extracellular, intracellular, biologic or 15 
chemical origin. The peptide fragment employed in such a 
test may either be free in solution, affixed to a solid support, 
borne on a cell surface or located intracellularly. One may 
measure, for example, the formation of complexes between 
KIN and the agent being tested. Alternatively, one can ^ 
examine the diminution in complex formation between KIN 
and a receptor caused by the agent being tested. 

Methods of screening for drugs or any other agents which 
can affect signal transduction comprise contacting such an 
agent with KIN fragment and assaying for the presence of a 
complex between the agent and the KIN fragment. In such 25 
assays, the KIN fragment is typically labelled. After suitable 
incubation, free KIN fragment is separated from that present 
in bound form, and the amount of free or uncomplexed label 
is a measure of the ability of the particular agent to bind to 
KIN. 30 

Another technique for drug screening provides high 
throughput screening for compounds having suitable bind- 
ing affinity to the KIN polypeptides and is described in detail 
in European Patent Application 84/03564, published on Sep. 
13, 1984, incorporated herein by reference. Briefly stated, 35 
large numbers of different small peptide test compounds are 
synthesized on a solid substrate, such as plastic pins or some 
other surface. The peptide test compounds are reacted with 
KIN fragment and washed. Bound KIN fragment is then 
detected by methods well known in the art. Purified KIN can 
also be coated directly onto plates for use in the aforemen- 
tioned drug screening techniques. In addition, non- 
neutralizing antibodies can be used to capture the peptide 
and immobilize it on the solid support. 

This invention also contemplates the use of competitive 
drug screening assays in which neutralizing antibodies 45 
capable of binding KIN specifically compete with a test 
compound for binding to KIN fragments. In this manner, the 
antibodies can be used to detect the presence of any peptide 
which shares one or more antigenic determinants with KIN. 

XV Identification of Molecules Which Interact with KIN 50 
The inventive purified KIN is a research tool for 

identification, characterization and purification of 
interacting, signal transduction pathway proteins. Appropri- 
ate labels are incorporated into KIN by various methods 
known in the art and KIN is used to capture soluble or 55 
interact with membrane-bound molecules. A preferred 
method involves labeling the primary amino groups in KIN 
with 125 I Bolton-Hunter reagent (Bolton, A. E. and Hunter, 
W. M. (1973) Biochem J 133:529). This reagent has been 
used to label various molecules without concomitant loss of 
biological activity (Hebert C. A. et al (1991) J Biol Chem 60 
266:18989-94; McColl S. et al (1993) J Immunol 
150:4550-4555). Membrane-bound molecules are incubated 
with the labelled KIN molecules, washed to removed 
unbound molecules, and the KIN complex is quantified. 
Data obtained using different concentrations of KIN are used 65 
to calculate values for the number, affinity, and association 
of KIN with the signal transduction complex. 



Labelled KIN fragments are also useful as a reagent for 
the purification of molecules with which KIN interacts, 
specifically including inhibitors. In one embodiment of 
affinity purification, KIN is covalently coupled to a chro- 
matography column. Cells and their membranes are 
extracted, KIN is removed and various KIN-free subcom- 
ponents are passed over the column. Molecules bind to the 
column by virtue of their KIN affinity. The KIN-complex is 
recovered from the column, dissociated and the recovered 
molecule is subjected to N-terminal protein sequencing. 
This amino acid sequence is then used to identify the 
captured molecule or to design degenerate oligomers for 
cloning its gene from an appropriate cDNA library. 

In an alternate method, monoclonal antibodies raised 
against KIN fragments are screened to identify those which 
inhibit the binding of labelled KIN. These monoclonal 
antibodies are then used in affinity purification or expression 
cloning of associated molecules. Other soluble binding 
molecules are identified in a similar manner. Labelled KIN 
is incubated with extracts or other appropriate materials 
derived from rheumatoid synovium. After incubation, KIN 
complexes (which are larger than the lone KIN fragment) are 
identified by a sizing technique such as size exclusion 
chromatography or density gradient centrifiigation and are 
purified by methods known in the art. The soluble binding 
protein(s) are subjected to N-terminal sequencing to obtain 
information sufficient for database identification, if the 
soluble protein is known, or for cloning, if the soluble 
protein is unknown. 

XVI Use and Administration of Antibodies or Other Inhibi- 
tory Molecules 

Antibodies, inhibitors, receptors or antagonists of KIN 
fragments (or other treatments to limit signal transduction, 
TST), can provide different effects when administered thera- 
peutically. TSTs will be formulated in a nontoxic, inert, 
pharmaceutical^ acceptable aqueous carrier medium pref- 
erably at a pH of about 5 to 8, more preferably 6 to 8, 
although the pH may vary according to the characteristics of 
the antibody, inhibitor, or antagonist being formulated and 
the condition to be treated. Characteristics of TSTs include 
solubility of the molecule, half-life and antigenicity/ 
immunogenicity; these and other characteristics may aid in 
defining an effective carrier. Native human proteins are 
preferred as TSTs, but organic or synthetic molecules result- 
ing from drug screens may be equally effective in particular 
situations. 

TSTs may be delivered by known routes of administration 
including but not limited to topical creams and gels; trans- 
mucosal spray and aerosol; transdermal patch and bandage; 
injectable, intravenous and lavage formulations; and orally 
administered liquids and pills particularly formulated to 
resist stomach acid and enzymes. The particular 
formulation, exact dosage, and route of administration will 
be determined by the attending physician and will vary 
according to each specific situation. • ' 

Such determinations are made by considering multiple 
variables such as the condition to be treated, the TST to be 
administered, and the pharmacokinetic profile of the par- 
ticular TST. Additional factors which may be taken into 
account include disease state (e.g. severity) of the patient, 
age, weight, gender, diet, time and frequency of 
administration, drug combination, reaction sensitivities, and 
tolerance/response to therapy. Long acting TST formula- 
tions might be administered every 3 to 4 days, every week, 
or once every two weeks depending on half-life and clear- 
ance rate of the particular TST. 

Normal dosage amounts may vary from 0.1 to 100,000 
micrograms, up to a total dose of about 1 g, depending upon 
the route of administration. Guidance as to particular dos- 
ages and methods of delivery is provided in the literature. 
See U.S. Pat. No. 4,657,760; 5,206,344; or 5,225,212. Those 
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modifications and variations of the described method and 
system of the invention will be apparent to those skilled in 
the art without departing from the scope and spirit of the 
invention. Although the invention has been described in 
connection with specific preferred embodiments, it should 
be understood that the invention as claimed should not be 
unduly limited to such specific embodiments. Indeed, vari- 
ous modifications of the above-described modes for carrying 
out the invention which are obvious to those skilled in the 
field of molecular biology or related fields are intended to be 
within the scope of the following claims. 



TABLE 1 



Clone 


Library 


GenBank/SwissProt Identifier, Name 


297 


U937 


P00540 Mouse protooncogene ser/thr 1nt«w 


1622 


U937 


HUMCLK3B dk3 gene product 


10007 


THP-1 Phorbol LPS 


HSPLK1 protein kinase 


12702 


THP-1 Phorbol LPS 


RATSGPK ser/thr kinase 


23789 


Inflamed Adenoid 


CHKFRNK chicken tyr kinase 


35652 


HUVEC 


KEK5 Chicken Y kinase receptor 


35855 


m mm mm? mmmm 

HUVEC 


HUMANBTK37 tyr kinase 


40194 


T + B Lymphoblast 


KRB1 VARV Variola virus protein kinase 


42170 


T+B Lympnoblast 


ribUUyj04 serine Jonase 


46Uol 


corneal otroma 


iodsjiNi yeast protein Kinase 


46651 


Corneal Stroma 


CDK4 PI 1802 


53840 


Fibroblast 


HSDAPK- Death-associated orotein kinase 


54065 


Fibroblast 


SCPROKIN 1 veast 35.6 kD 

mmT ^0*m. A* mC^m\ m ^ m\ W w U mt V mf mf » 


56494 


Fibroblast 


KLMC RAT, myosin light chain kinase 




aiceietai Muscie 


/vin^iKiA i a. smitana i Kinase receptor 


64663 


Placenta 


KIN3 Yeast protein kinase P22209 


67967 


HUVEC Sheer Stress 


YAK1 Yeast protein kinase 


68963 


HUVEC Sheer Stress 


KATK Human Y kinase 


71904 


Placenta 


KIN3 P22209SwP 


75289 


THP-1 Phorbol 


H5U08023 Avian retrovirus rpl30 


81865 


Rheumatoid Synovium 


SNFT Yeast C catabolite derepressing 


82056 


HUVEC Sheer Stress 


P34314 C. elegans sei/thr kinase 


108485 


AML Blast 


KAPA Pig cAMP-dependent protein kinase 


114973 


Testis 


CC2B ARATH Mouse-ear cress cdc 


118591 


Skeletal Muscle 


PB0192 mixed lineage kinase 1 


119819 


Skeletal Muscle 


H5U09564 ser kinase 


120376 


Skeletal Muscle 


U01064 Y kinase 


132750 


Bone Marrow 


MLK2 mixed lineage kinase 2 


140052 


T Lymphocyte 


G-protein coupled receptor kinase 


146392 


T Lymphocyte 


SCYAK1 Yeast Yakl kinase 


156108 


THP-1 Phorbol LPS 


U01064 Dictyostelium Y kinase 


173627 


Bone Marrow 


MMU14166 Kiz 


181971 


Placenta 


HUMTKR Y kinase receptor 


182538 


Placenta 


HSNEK2R kinase 


184416* 


Cardiac Muscle 


KPKS Human proto-oncogene Scr/Thr kinase 


191283 


Rheumatoid Synovium 


RATSGPK Ser/Thr kinase 


192268 


Rheumatoid Synovium 


ATHAPK1A Ser/Thr kinase 


214915 


Stomach 


XLMPK2K Map kinase 


223163 


Pancreas 


TGF-p receptor ser/thr kinase 


237002 


Small Intestine 


P16227 Mouse Y kinase blk 


239990 


Hippocampus 


SHC Human transforming protein 


■ 240142 


. Hippocampus 


HSNEK2R 


275781 


Testes 


BOVCKIA casein kinase 


285465 


Eosinophils 


DDIMLCK myosin light chain kinase 



SEQUENCE USTENG 



( 1 ) GENERAL INFORMATION: 

(Ml ) NUMBER OF SEQUENCES: 45 



( 2 ) INFORMATION FOR SEQ ID NOH: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 526 base pain 



skilled in the art will employ different formulations for 
different TSTs. Administration to cells such as nerve cells 
necessitates delivery in a manner different from that to other 
cells such as vascular endothelial cells. 

It is contemplated that disorders or diseases which trigger 5 
defensive signal transduction may precipitate damage that is 
treatable with TSTs. These disorders or diseases may be 
specifically diagnosed by the tests discussed above, and such 
testing should be performed in cases where physiologic or 
pathologic problems are suspected to be associated with ]£ 
abnormal signal transduction. 

All publications and patents mentioned in the above 
specification are herein incorporated by reference. Various 
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( B ) TYPE: nucleic odd 

( C ) STRANDEDNESS: single 

( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: U937 
( B ) CLONE: 297 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:l: 

ACAAGGGTTG TAATTAAAGG CGATTTTGAA ACAATT AA A A TCTGTGATGT AGGAGTCTCT 60 

CTACCACTGG ATGAAAATAT GACTGTGACT GACCCTGAGG CTTGTTACAT TGGCACAGAG 120 

CCATGGAAAC CCAAAGAAGC TGTGGAGGAG AATGGTGTTA TTACTGCAAG GCAGACATAT 180 

TTGCCTTTGG CTTACTTTGT GGGAAATGAT GACTTTATCG ATTCCACACA TTAATCTTTC 240 

AAATGATGAT GATGATGAAG T A A A A A CTT T TTGATGAA A A GTAATTTTGA TGTTGAAGCA 300 

TTACTATGCA AGCCCTTTGG ACCTAAGGCC ACCCT AT TTT AATATTGGAG GACCTTGOTG 360 

AATCATACCC AGGAAGGTAA TTTGACCTCT TCT CTGATCA CCCTTATTGA AGCCCCCAAG 420 

CACCCTTCTT GTGACAATTT TAGGTTGGAC CAGTTGCTTT GGGCCAACTT AACTAAAGTT 480 

GTTCGAAAAA CTTTTTTCCA AAAATTTCCA TAGGCCTCCC AAGTTT 526 

( 2 ) INFORMATION FOR SEQ ID NO:2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 378 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single . 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: cDNA 

( v i I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: U937 
( B ) CLONE: 1622 

(il) SEQUENCE DESCRIPTION: SEQ ID NO:2: 

AGAACACCAC ATCCGAGTGG CTGACTTTGG CAGTGCCACA TTTGACCATG AGCACCACAC 60 

CACCATTGTG GCCACCCGTC ACTATCGCCG CCTGAGGTOA TCCTTGAGCT GGGCTGGGCA 120 

CAGCCTGGTG ACGTCTGGGC ATTGGCTG'CA TTCTCTTTGA GTACTACCGG GGCTTCACAC 180 

TCTTCCAGAC CCACGAAAAC CGAGAGCACC TGGTGATGAT GGAGAAGATC CTAGGGCCCA 240 

TCCCATCACA CATGATCCAC CGTACCAGGA AGC AG A AT AT TTCTACAAAG GGGGCCTAGT 300 

TTGGGATGGA CAGCTCTTAC GGCCGGT ATG T A AGGGACTC AAACCTTTAA GGTTCATGTT 360 

CAAGCTTCCT GGGAAGTG 378 

( 2 ) INFORMATION FOR SEQ ID NO:3: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 326 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: THP-1 Phorbol LPS 
( B ) CLONE: L0007 

( » 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:3: 

GGGCTGG C AG CCCGGTTGGA GCCTCCGGAG CAGAGGAAGA AGACCATCTT GGCACCCCCA 60 
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ACTATGTCGC 


TCCAGAAGTG 


CTGCTGAGAC 


AGGGCCACGG 


CCCTGAGGCG 


GATGTATGGT 


1 2 0 


CACTGGGCTG 


TGTCATGT AC 


ACGCTGCTCT 


GCGGGACCCT 


CCCTTTGAGA 


CGGCTGACCT 


1 8 0 


GAAGG AGACG 


TACCGCTGCA 


T CA AGAAGGT 


TCACTACAAC 


GGTGCCTGCC 


AGCTCTT AAT 


2 4 0 


GCCTGCCCGA 


GTCCTTGGCC 


GCAATCCTTC 


GGGCCTTAAC 


C CGAGAA C CG 


GCCCTCT ATT 


.300 


G AC AG AT C CT 


TGCGG CA ATT 


AACTTT 








3 2 6 



( 2 ) INFORMATION FOR SEQ ID NO:4: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 257 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: THP-1 Phorbol LPS 
( B ) CLONE* 12702 

( x 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:4: 

CCGCAAGACA CCTCCTGGAG GGCCTCCTGA GAAGGACAGG CAAAGGGCTG GGCCAAGGAT 

GACTTCATGG AGATTAAGAG TCATGTTTCT TCTCCTTAAT TAACTGGGAT GATCTCATTA 

ATAAGAAGAT TACTCCCCCT TTTACCCAAA TGTGAGTGGG CCCAACGCCT ACGGACTTTG 

CCCCGAGTTT ACGAAGAGCC TTCCCCAATC CATTGGAAGT CCCCTGAAAG GTCCTATACA 
AGTCAGTTAA GGAAGTT 



( 2 ) INFORMATION FOR SEQ ID NO:S: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 252 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Inflamed Adenoid 
( B ) CLONE: 23789 



( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:5: 



GTGAAGAATG 


TGGGGCTGAC 


CCTCGGAAGT 


CATCGGGAGC 


GTGGATGATC 


TCCTGCCTTC 


6 0 


CTTGCCGTCA 


TCT CACGGAC 


AGAGATCGAG 


GGC ACCCAGA 


AACTGCTCAA 


CAAAGACCTG 


1 2 0 


GCACAGCTCA 


TCA ACAAGAT 


GCGCTGGCGC 


AAGAAC GCGT 


GACCTCCCTG 


TAGGAGT AAG 


1 8 0 


AGGCAGATCT 


GACGGTTCAC 


A ACC CTGGCT 


GTGACGCAAG 


AACCTCTTAC 


GTGTGCCAGG 


2 4 0 


CCCAAAGTTC 


TG 










2 5 2 



( 2 ) INFORMATION FOR SEQ ID NO:6: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 255 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Hnvec 
(B) CLONE: 35652 



( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:6: 



31 
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CAAA ATCGTG 


GCCCGGAG A A 


TGGCGGGGCC 


TC AACCCTCT 


CCTGGACCAG 


CGGCAGCT CA 


6 0 


CTACTCAGCT 


TTTGGCCTGT 


GGGCGAGTGG 


CTTCGGGCCA 


TCAAAATGGG 


AAGATACGAA 


1 2 0 


GAAAGTTTCG 


CAGCCGCTCG 


CTTTGGCTCC 


TTC AGCTGGT 


CAGCCAGATC 


TCTGCTGAGG 


1 8 0 


ACCTGCTCCG 


AATCGAGTCA 


CTCTGGCGGG 


AC ACCAGAAG 


AAAATTTGGC 


CAGTTCCAGC 


2 4 0 


ACATGAGTCC 


CAGGT 










2 5 5 



( 2 ) INFORMATION FOR SEQ ID NO:7: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 238 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

. ( I i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Huvec 
( B ) CLONE 35855 

(il) SEQUENCE DESCRIPTION: SEQ ID NO:7: 

GAATACCCCA TATACATAGT G ACT GAT AT A TAAGCAATGG CTGCTTGCTG AATACCTGAG 60 

GAGTCACGGA AAAGGCTTAA CCTTCCCAGT CTTAGAAATG TGCTACGATG TCTGTAAGGC 120 

ATGGCCTTCT TGGAGAGTCA CCAATTCATA CACCGGGCTT GGCTGCTCGT AACTGCTTGG 180 

TGGACAGAGA TCTCTGTGTG AAAGTTCTCC ATTT GGATG A CAAGGTATGT TCTTGATG 238 

( 2. ) INFORMATION FOR SEQ ID NO:8i 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 261 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: T+B Lymphoblast 
( B ) CLONE: 40194 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:8: 



AAACAACTTG 


ATTATTTAGG 


AATT CCTCTG 


TTTT ATGGAT 


CTGGTCTGAC 


TGAATT CAAG 


6 0 


GGAAGAAGTT 


ACAGATTTAT 


GGT A ATGGA A 


AGACTAGGAA 


T AGATTT AC A 


GAAGAT CTCA 


1 2 0 


GGCCAGAATG 


GTACCTTTAA 


AAAGTCAACT 


GTCCTGCAAT 


TAGGATCCGA 


ATGTTGGATG 


1 8 0 


T ACTGG A ATA 


TATACATGAA 


AATGAATATG 


TTCATGGTGA 


TATAAAAGCA 


GCAAAT CTAC 


2 4 0 


TTTTGGGTTA 


CAAA A A TCCT 


T 








2 6 1 



( 2 ) INFORMATION FOR SEQ D> NO& 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 242 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: T+B Lymphoblast 
( B ) CLONE: 42170 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:9: 
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TAAGAAACCT GAAGATCGAG CCACTCCTGA ACAATGTCTA AAGCACCCCT GGTTGACACA 60 

GAGCAGTATT CAAGAGCCTT CTTTCAGGAT GGAAAAGGCA CTAGAAGAAG CAAATGCCCT 120 

CCAAGAAGGT CATTCTGTGC CTGAAATTAA TTCGGATACC GACAAATCAG AAACCGAGGA 180 

ATCCATTGTA ACCGAAGAGT TAATTGTAGT TACTT.CATAT ACTCTAGGGC AATGCAGACA 240 

GT 2 4 2 

( 2 ) INFORMATION FOR SEQ ID NO: 10: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 222 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE* cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Corneal Stroma 
( B ) CLONE: 46081 

( x I ) SEQUENCE DESCRIPTION: SEQ ED NO:l<k 

GCAAAGGACA GTCCGCCGAG GTGCTCGGTG GAGTCATGGC ATTCCCTTTT GGAAGACTGG 60 

CCTTGGTGCA AACCCTGGAG AAGGTGCCTA TGGAGAAGTT CAACTTGCTG TAAATAGAGT 120 

AACTAAGAAG CAGTCGCAGT GAAGATTTAG AT AT A AGC GT GCCGTAGACT GTCCCGAAAA 180 

TATTAAGTAG ATCTGTATCA ATAAAATGCT AATCATGAAA TT 222 

( 2 ) INFORMATION FOR SEQ ED NO:lt: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 225 base pain 
( B ) TYPE: nncleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Corneal Stroma 
( B ) CLONE: 46651 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:ll: 

ATGCTCCGCC AGTGAGAAGG GCGGCTGCCT GAGCGCCTCA CCAGTCCTCA TCACCCAGAT 60 

CCTGTGGCTT TGAGACACCT TCACTTAAGA AC ATT TGCCA CTTGACTTAA ACCAGAAACG 120 

TGTTTTGTGG CATCAGCAGA CCCTTTCTCA GGTAAGTTGT GCTTTGCTTT TAGCATACGT 180 

GAGAAGTTGT TCCGCTCCAT TTTGTGGGAC GTCTTTCTTT CCTTG 225 

( 2 ) INFORMATION FOR SEQ ID NO: 12: 

( I ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 256 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

(II) MOLECULE TYPE: cDNA 

( v 1 i ) IMMEDIATE SOURCE 

( A ) LIBRARY: Fibroblast 
( B ) CLONE: 53840 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:12: 
CAGCGCCTTA CATCTCGCAG CCAAGAACAG CCACCATGAA TGCATCAGGA AGCTGCTTCA 60 
TCTA AATGCC CAGCCGAAAG TTTTGACAGC TCTGGGAAAA CAGCTTTACA TTATGCAGCG 120 



35 
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GCTCAGGGCT GCCTTCAAGC TGTGCACATT CTTGCGAACA CAAGAGCCCC ATAAACCTCA 180 
AAGATTTGGA TGGGAATATA CCGCTGCTGC TTGCTGTACA AAATGGTCAC AGTGAGATCT 240 
GTCACTTTTC CTGGTC . 256 

( 2 ) INFORMATION FOR SEQ ID NO:13: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 240 base pain 
( B ) TYPE: nucleic odd 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Fibroblast 
( B ) CLONE: 54065 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:13: 

GTTGACATCT GGTCCCTGGG CAT AT GG CCA TCGAAATGAT TGAAGGGGAG CCTCATACCT 60 

CAATGAAAAC CCTTGAGAGC CTTGTACCTC ATTGCCACCA ATGGGACCCC AGAACTT CAG 120 

AACCCAGAGA AGCTGTCAGC TAT CTTCCGG GACTTTCTGA ACCGCTGTCT CGAGATGGAT 180 

GTGGAOAAGA GAGGTTCAGC TAAAGAGCTG CTACAGCATC AATTCCTGAA GATTGCCAAT 240 

( 2 ) INFORMATION FOR SEQ ID NO:14: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 195 base pirs 
( B ) TYPE: nucleic acid 
( C ) STRANDED NESS : single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Fibroblast 
( B ) CLONE: 56494 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:14: 

A A C A G T G A A G AGCTCCGAGA AATTATGGGT ACCCTGATAT GTGGCTCCTG AAATTTAGTT 60 

ATGATCCTAT AAGCATGGCA ACAGATATTG GAGCATTGGA GTGTTAACAT ATGTCATGCT 120 

TACAGCAATA TCACCTTTTT AGGCAATGAT AAACAAGAAA CATTCTTAAA CATCT CACAG 180 

ATGATTTTAA GTTAT 195 

( 2 ) INFORMATION FOR SEQ □> NO:15: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 207 base pain 
( B ) TYPE: nucleic acid 
( C ) STRAND EDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: cDNA 

( v 1 i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Skeletal Muscle 
( B ) CLONE: 58029 

( x i ) SEQUENCE DESCRIPTION: SEQ tD NO:l5: 

GGAGTGTTTA TCGAGCCAAA TGGATATCAC AGGACAAGGA GGTGGCTGTA AAGAAGCTCC 60 

TCAAAATAGA GAAAGAGGCA GAAATACTCA GTGTCCTCAG TCACAGAAAC ATCATCCAGT 120 

TTTATGG AGT AATTTTGAAC CTCCCAACTA TGGCATTGTC ACAGAATATG CTTCTTGOGT 180 
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CACTCTATGA TTACATTAAC AG T A C A A 207 

( 2 ) INFORMATION FOR SEQ ID NO:16: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 184 base pain 
( B ) TYPE: nucleic scid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I 1 ) MOLECULE TYPE: cDNA 

( v I 1 ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Placenta 
( B ) CLONE: 64663 

( X i ) SEQUENCE DESCRIPTION: SEQ ID NO:16: 

CGGGGTGGTA AAACTTGGAG ATCTTGGGAT TGGCGGTTTT AGCTCAAAAA CCACAGCTGC 60 

ACATTCTTTA GTTGGTACGC CTATTCATOT TCCAGAGGAT A C AG AAA TGG ATACAACTTC 120 

AAATCTCATC TGGTCTCTTG GCTGTCTACT ATATGGATGG CTGCATTACA AAGTCCTTTC 180 

T ATG 184 

( 2 ) INFORMATION FOR SEQ ID NO: 17: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 206 base pairs 
( B ) TYPE* nodeic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

• ( i i ) MOLECULE TYPE: cDNA 

( v I 1 ) IMMEDIATE SOURCE* 

( A ) LIBRARY: HUVEC Sheer Stress 
( B ) CLONE: 67967 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:17: 

TGAATTGCTG AGCATAGACC TTTATGAGCT CATTAAAAAA AATAAGTTTC AGGTTTTAGC 60 

GTCCAGTTGG TACGCAAGTT TGCCCAGTCC ATCTTGCAAT CTTTGGTGCC CTCCACAAAA 120 

TAAGATTATT CACTCCGATC TGAGCCAGAA AACATTCTCC TGAAACACCA CGGGCGCAGT 180 

TCAACCAAGG TCATTGACTT TGGGTT 206 

( 2 ) INFORMATION FOR SEQ ID NO:18: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 268 base pairs 
( B ) TYPE* nodeic odd 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE- cDNA 

( v i i ) IMMEDIATE SOURCE 

( A ) LIBRARY: HUVEC Sheer Stress 
( B ) CLONE 68963 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:l& 

GGGAAGTGGC CAGTTTGGAG TGGTCAGCTG GGCAAGTGGA AGGGGC AGT A TG ATG T TG C T 60 

GTTAACATGA TCAAGGAGGG CTCCATGTCA CAAGATGAAT TTTTCAGGAG GCCCAGACTA 120 

TATGAAACTC AGCCATCCCA AGCTGCTTAA ATTCTATGGA GTGTGTT A A A GGATT ACCCC 180 

ATATACATGT GACTAATATA TAGCAATGCT TGCTTTTCTG AATTACCTGG GGAGTCACGG 240 

AAAAAGGACT TTTAACCCTT CCCGCTTG 268 
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( 2 ) INFORMATION FOR SEQ ID NO: 19: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 224 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: cDNA 

( v I I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Placenta 
( B ) CLONE: 71904 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:19: 

CCTGGGGTGG TAAAACTTGG AGACTTGGCT TGGCCGGTTT TCCACCTCAA AAACCACAGC 60 

TGCACATCCT TTAGTTGGTA CGCCTTATTA CATGTTCCAG AGAGATACAT GAAAATGGAT 120 

ACAACTCAAA CTGACATCTG GCCTTTGGCT OTTACTATAT GAATGGCTGC TTACAAAGCC 180 

TTCCTATGGT GACAAAATGA TTTTACTCAT TGTGTAAGAG ATAG 224 

( 2 ) INFORMATION FOR SEQ ID NO:20: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 195 base pairs 
( B ) TYPE: nuclei c acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: THP-1 Phorbol 
( B ) CLONE: 75289 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:20: 

GCGGGGAATG ACTCCCTATC CTGGGGTCCA GAACC ATGAG ATGTATGATA TCTT CT C CAT 60 

GGCCACAGGT TGAAGCAGCC CGAAGACTGC CTGGTGAACT GTATGAAATA ATGTACTCTT 120 

GCTGGAGAAC CGATCCCTTA GACCGCCCCA CCTTTTCATA TTGAGGCTGC AGCTAGAAAA 180 

ACTCTTAGAA AGTTT 195 

( 2 ) INFORMATION FOR SEQ ID NO:2l: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 219 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Rheumatoid Synovium 
( B ) CLONE: 81865 

( z i ) SEQUENCE DESCRIPTION: SEQ ID NO:21: 

CACACGAGAA GCAGAAACAC GACGGGCGGG TAAGATCGGC CACTACATTC TGGTGACACG 60 

CTGGGGGTCG GCACCTTCGG CAAAGTGAAG GTTGGCAAAC AT GAT TG A C T GGCATAAAGT 120 

AGCTGTAAGA TACTCATCGA C AG A AG AT T C GGAGCCTTGA TGTGGTAGGA AAAATCCCAG 180 

GAAATTCAGA ACCTCAAGCT TTTCAGGCAT CCTCATATA 219 

( 2 ) INFORMATION FOR SEQ ID NO:22: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 181 base pairs 
( B ) TYPE: nucleic acid 
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( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: HUVEC Sheer Stress 
( B ) CLONE: 82056 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:22: 

CCACCAAAGA TCTCAAATAA AGTTGATCTG TGGTCGGTGG GTGTATCTCT ATCAGTGTCT 60 

TTATGGAAGG AAGCCTTTTG GCCATAACCA GTCTCAGCAA GACATCCTAC AAGAGAATAC 120 

GATTTTAAAG CTACTGAAGT GCAGTTCCCG CCAAAGCCAG TAGTAACACC TGAAGCAAAG 180 

G 18 1 

( 2 ) INFORMATION FOR SEQ ID NO:23: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 218 base pain 
( B ) TYPE: nucleic acid 
( C ) STRAND EDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: AML Blast 
( B ) CLONE: 108485 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:23: 

. TATGGTTATA TGGAAGAGAA TGTGACTGGT GGTCGG.TTGG GGT ATTTTTA f ACGAAATGC 60 

TTGTAGGTGA TACACCTTTT TATGCAGATT CTTTGGTTGG AACTTACAGT AAAATTATGA 120 

ACCATAAAAA TTCACTTACC TTTCCTGATG ATAATGACAT ATCAAAAGAA GCAAAAAACC 180 

TTATTTGTCC CTTCCTT ACT GACAGGGAAG TGACGTTA 2 18 

( 2 ) INFORMATION FOR SEQ ID NO:24: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 264 base pairs 
( B ) TYPE: ooclcic acid 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v J i ) IMMEDIATE SOURCE; 

( A ) LIBRARY: Testis 
( B ) CLONE: 114973 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:24: 

G AC GGTGGCC ATTTGACATG TGGAG CCTGG GTGCATCACG GTGGAGTTGT ACACGGGCTA 60 

CCCCCTGTTC CCCGGGAGAA T GAGGTGG AG CAGCTGGCCT G CAT CATGG A GGTGCTGGGT 120 

CTGCCGCCAG CCGGCTTCAT TCAGACAGCC TCCAGGAGAC AGACATTCTT TGATTCCAAA 180 

GGTTTTCCTA AAAATATAAC CACAACCAGG GGAAAAAAAG ATTCCAGATT CCAAGGGCCC 240 

TCACGGATTG GTGCTGAAAA AACT 264 

( 2 ) INFORMATION FOR SEQ ID NCh25: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 236 base pairs 
( B ) TYPE: nndeic acid 
( C ) STRANDED NESS: single 
( D ) TOPOLOGY: linear 
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( ! i ) MOLECULE TYPE: cDNA 

( v 1 i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Skeltal Muscle 
( B ) CLONE: U8591 

(x i ) SEQUENCE DESCRIPTION: SEQ ID NO:25: 

GACTOAOGAC ACTGAAACAT CAT CCAGTT T TAT GGAGT A A TTCTTGAACC TCCCAACTAT 60 

GGC ATTGTC A CAGAATATGC TTCTCTGGGA TCACTCTATG ATTACATTAA CAGTAACAGA 120 

AGTGAGGAGA TGCATATGGT CACATTATGA CCTGGGCCAC TGATGTAGCC AAAGGAATGC 180 

ATTATTTACA TATGGGGCTC CTGTCAAGGT GATTCACAGA GACCTCAAGT CAAGGA 236 

( 2 ) INFORMATION FOR SEQ ID NO:26: 

( i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 200 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRAND EDNESS: single 
( D ) TOPOLOGY: linear 

( ) I ) MOLECULE TYPE: cDNA 

(vl 1) IMMEDIATE SOURCE: 

( A ) LIBRARY: Skeltal Muscle 
( B ) CLONE: LI 98 19 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:26: 

CCTGCATGGC CTTCGAGCTG GCCACTGGTG ACTACCTGTT CGAGCCGCAT T CTG GAG A AG 60 

ACTACAGTCG TGATGAGGGT A AGGGGTG AG GGCTCTGGGC TCAGCCTCCC GGCCTCCCGG 120 

CCTGCCTGCC CCCAACCTCC TCTTTTGCCC' ACAGACCACA TCGCT CA CAT AGTGGAGCTT 180 

CTGGGGGACA TCCCCCCAGC 200 

( 2 ) INFORMATION FOR SEQ ID NO:27: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 217 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v I I ) IMMEDIATE SOURCE* 

( A ) LIBRARY: Skeletal Muscle 
( B ) CLONE 120376 

(li) SEQUENCE DESCRD7TI0N: SEQ ID NO:27: 

CATTACAAGT AGCTTGGTTG TAGTGGAAAA AAACGAGAGA TTAACCATTC CA AG C AGTTG 60 

CCCCAGAAGT TTTGCTGAAC TTTACATCAG TTTGGGAAGC TGATGCCAAG AAACGGCCAT 120 

CATTCAAG CA AATCATTTCA ATCCTGGGTC CATGTCAAAT GACACGAGCC TTCCTGCAAG 180 

TGTAACTCAT TCCTACACAA C A AGG CGGAG TGGAGGT 2 17 

( 2 ) INFORMATION FOR SEQ □> NO:28: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 156 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Bone Marrow 
( B ) CLONE: 132750 
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(xi) SEQUENCE DESCRIPTION: SEQ ID NO:2& 

GTAOATTTGA CTCTGTTGTT TTCTCTCGTA GTTCCCAAAC TCATGGAAGT CTGTTTTTAT 60 

C A ATA TGATG T A A A G T C TG A A A T A T A C A G C TTTGGAATCG TCCTCTGOOA AATCGCCACT 120 

GGAGATATCC CGTTTCAAGG CTGTAATTCT GAGAAG 156 

( 2 ) INFORMATION FOR SEQ ID NO:29: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 224 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS : single 
( D ) TOPOLOGY: linear 

( i 1 ) MOLECULE TYPE: cDNA 

( v I i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: T Lymphocyte 
( B ) CLONE: 140052 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:29: 

TGTAAATAAG GCCCTTCTCC ACTTGACTTC AGGCAGCAGA TTGTC TAG A A GCCTAAGGAC 60 

AGCAATTTCT CTGACAAGAC AAAGTAGATA TTTTATACCA GGGGT TGGC A AACTACTGCC 120 

CACGGGCCGA ATTTGGCCCA GTCTGTTTTT GTATGGTGCA AACTAAAAAT OATTTTTACA 180 

TTTTTAAAGA GTTATAAAAG AAAAAAATAT GTGGTCTGTG A A A T 224 

( 2 ) INFORMATION FOR SEQ ID NO:30: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 198 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I i ) MOLECULE TYPE: cDNA 

( v M ) IMMEDIATE SOURCE: 

( A ) LIBRARY: T Lymphocyte 
( B ) CLONE: 146392 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:30: 

TTTTCTTTGT GTTTTTTTTT GTTCCAGTTT ATTTTAAATG CATATTTTAG TTGATTGCTT 60 

T T T T A AAA AG CCCCCTCTGG CCTCCTGATT CCAGCTAGTG TCAGCAGTGG GATACCTGCG 120 

CTTGAAGGAC ATCATCCACC GTGACATCAA GGATGAGAAC ATCGTGATCG CCGAGGACTT 180 

CACAATCAAG CTGATAGT 198 

( 2 ) INFORMATION FOR SEQ ID NO:31: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 210 base pairs 
< B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v 1 1 ) IMMEDIATE SOURCE: 

( A ) LIBRARY: THP-1 Phorbol LPS 
( B ) CLONE: 156108 

( x I ) SEQUENCE DESCRDJTION: SEQ tt> NCh3l: 

TGAAAACTAT GAACCTGGAC AAAAATCAAG GGCCAGTATC AAGCACGATA TATATAGCTA 60 

TGC AGTTATC ACATGCGAAG TGTTATCCAG AAAACAGCCT TTTGAAOATG TCACCAATCC 120 





5,817,479 

47 48 

-continued 



TTTGCACATA ATGTATAGTG TGTCACAAGG ACATCCACCT GTTATTAATG AAGAAACTTT 180 
GCCATATGAT ATACCTCACC GAGCACGTAT 210 

( 2 ) INFORMATION FOR SEQ ID NO:32: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 202 base pain 
( B ) TYPE: nucleic acid 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE* cDNA 

(vl I) IMMEDIATE SOURCE: 

( A ) LIBRARY: Bone Marrow 
( B ) CLONE* 173627 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:32: 

AGA AG AT C GG GGCCGGCTTC TTCTCTOAGG TCTACAAGGT TCGGCACCGA CAGTCAGGGC 60 

AAGTATGGTG CTG A AG A TG A ACAAGCTCCC CAGTAACCGG GGCAACACAC TACGGGAAGT 120 

GCAGCTGATG AACCGGCTCA GGCACCCCAA CATCCTAAGG TTCATGGGAG TCTGTGTGCA 180 

CCAGGGAC AG CTGCACGCTC TT 202 

( 2 ) INFORMATION FOR SEQ ID NO:33: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 222 base pairs 
( B ) TYPE nucleic acid 
( C ) STRAND ED NESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE cDNA 

( v i I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Placenta 
( B ) CLONE 181971 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:33: 

CGTTTTTGGA GGGTTCACAC CTGTCCCTTT CAAATGCTGG CGCTTTCACA CACTCCTTCT 60 

CTCCTGCCAG CACCTTCTGG TCTCAGGAGC ATTGCAGGAT GTTGTGTGAG TAAGTATGGG 120 

AGACACTTTA GTATGGCTTT TTTCAGCTTA GCCTCCTGTT ATCAGAGAGC AGTCTCTTTC 180 

AGTGTC AACG TTTGAGTACT AGATGGTGGA GAAAGCCTGT TT 222 

( 2 ) INFORMATION FOR SEQ ID NO:34: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 192 base pairs 
( B ) TYPE nucleic acid 
( C ) STRANDEDNESS: single 
( D )TOPOLOGY: linear 

( i i ) MOLECULE TYPE cDNA 

( v i i ) IMMEDIATE SOURCE 

( A ) LIBRARY: Placenta 
( B ) CLONE 182538 

( k i ) SEQUENCE DESCRIPTION: SEQ ID NO:34: 

CTTGGGGTGG TAAAACTTGG AGATCTTGGG CTTGGCCGGT TTTTCAGCTC AAAAACCACA 60 

GCTGCACATT CTTTAGTTGG TACGCCTT AT TACATGTCTC CAGAGAGAAT ACATGAAAAT 120 

GG AT ACA A CT TCAAATCTGA CAT CTGGT CT CTTGGCTGTC TACTATATGA GATGGCTGCA 180 

TTACAAAGTC CT 192 
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( 2 ) INFORMATION FOR SEQ ID NO:35: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 152 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( 1 I ) MOLECULE TYPE: cDNA 

( v I I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Cardiac Muscle 
( B ) CLONE: 184416 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:35: 

C T ATCOAAGG CCGCTGGCAG GOCAATGACA TTGTCGTGAA GGTGC TG A AG GTTCGAGACT 60 

GGAGTACAAG G A A GAG C AG G GACTTCAATG AAGAGTGTCC CCGGCTCAGG ATTTTTCGCA 120 

TCCAAATGTG CTCCCAGTGC TAGGTGCCTG CC 152 

( 2 ) INFORMATION FOR SEQ ID NO:3fi: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 152 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Rheumatoid Synovium 
( B ) CLONE: 191283 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO J6: 

CAACTACAGT GAACCTAAAA TGCCTCTAAT ACCTTTGCAA TTATCTTTAA GAGGATATCT 60 

TAT GAGTGA A ATTAACTTGT GCAACTACTT TCCTATTCAC TTTTTTACAG AGACTTAAAA 120 

CCAGAGA ATA TTTCTAGATT CACAGGGACA CT 152 

( 2 ) INFORMATION FOR SEQ ID NO:37: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 199 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: cDNA 

( v I i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Rheumatoid Synovium 
( B ) CLONE: 192268 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:37: 

AGTGGACTGC AGTAAGCAGA GCTTCCTGAC CGAGGTGGAG CAGCTGTCCA GGTTTCGTCA 60 

CCCAAACATT GTGGACTTTC TGGCTACTGT GCTCAGAACG GCTTCTACTG CCTGGTGTAC 120 

GGCTTCCTGC CCAACGGCTC CCTGGAGGAC CGTTCCACTG CCAGACCCAG GCCTGCCCAC 180 
CTCTCTCCTG GCCTCAGCG 199 

( 2 ) INFORMATION FOR SEQ ID NO:38: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 189 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 



( i i ) MOLECULE TYPE: cDNA 
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( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Stomach 
( B ) CLONE 214915 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:3& 

AOAACATCCA GTACCTGOTC TATCAATOCT CAAACGCCTT AACTACATCC ACT CTCTGGG 60 

GTCGTGCACA OGCACCTOAA GCCAGGCAAC CTGGCTGTGA ATAGGACTGT AACTGAAGAT 120 

TCTGGATTTT GGGCTGGCGC GACATGCAGA C GCCG AG A TG ACTGGCTACG TGGTGACCCG 180 

CTGGTACCT l89 

( 2 ) INFORMATION FOR SEQ ID NO:39: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 167 base pain 
< B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: cDNA 

( v i I ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Pancreas 
( B ) CLONE: 223163 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:39: 

CTTGCTCTTC TGACAGGATG AGAGTTATTA TAAGCAAATC CTACCTAGAG GCTTTTAACT 60 

CTAATGGGAA TAACTTGCAA CTAAAAGACC CA ACTTGC AG ACCAAAATTA TCAAATGTTG 120 

TGGATTTTCT GTCCCTCTTA ATGGATGTGG TACAATCAGA AAGGTAG 167 

( 2 ) INFORMATION FOR SEQ ID NO:40: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 197 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Small Intestine 
( B ) CLONE: 237002 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:40: 

CCCAAACCTG CCCAGCCAGC CCTGAAAATG CAAGTTTTGT ACGATTTTGA AG CT AGG A AC 60 

CCACGGGAAC TGACTGTGCT CCAGGGAGAG A AGC T GO A G G TTTGGACCAC AGCAAGCGGT 120 

GGTGGCTGGT GAAGAATAGG CGGGACGGAG CGGCTACATT CCAAGCAACA TCTGGGCCCC 180 

TACAGCCGGG GACCCCG 197 

( 2 ) INFORMATION FOR SEQ CD NO:*l: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 207 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Hippocampus 
( B ) CLONE: 239990 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:41: 
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CCAAGATGCT CGAGCAACTC AAGCCOAGAC TTGT ACCAAG CAGACATGAG CAGGAAGGAG 60 

GCAGAGGGCT CTGAGA A AG A CGGGACTTCC TGGTC AGGAA CAGCACCACC AACCCGGGCT 120 

CCTTTTCCTC ACGGGCATGC ACAATGGCCA GGCAAGCACC TGCTGCTCTT GGACCCAGAA. 180 

GGCACGTCCG GACAAAGGCA GAGTCTT 2 07 

( 2 ) INFORMATION FOR SEQ ID NO:42: 

( t ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 195 base pair* 
( B ) TYPE: nucleic acid 
( C ) STRANDED NESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Hippocampus 
( B ) CLONE: 240142 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:42: 

GTCACCGGAG AGGATCCATG AGAACGGCTA CAACTTCAAG TCCGACATCT GGTCCTTGGG 60 

CTGTCTGCTG TACGAGATGG CAGCCCTCCA GAGCCCCTTC TATGGAGATA AGATGAATCT 120 

TTCTCCCTGT GCCAG A AGAT CGAGCAGTGT GACTACCCCC CACTCCCCGG GGAGCA CT AC 180 

TCCGAGAAGT TACGT 195 

( 2 ) INFORMATION FOR SEQ ID NO:43: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH- 213 base pin 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Testes 
( B ) CLONE: 275781 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:43: 

CTCGT CTATT CGGCACGAGT TT CATTGTCG AAGGAAATAT AAA CTGTCTG GA AGAT CT GO 60 

TGTAGCTCCT TCG AG A CAT C TTTGGCGATC AGCATCACCA ACGGTAAGAA GTGTAGTAAG 120 

CCAGATCTCA GGGCCAGGCA TCCCCAGTTG CTGTACAAGA GCAGGCTTTC AAGATGCTTC 180 

AAGGT CC CTG TCCATCAATA TGCTACACAT TTG 213 

( 2 ) INFORMATION FOR SEQ CD NO:44: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 425 base pairs 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i I ) MOLECULE TYPE: cDNA 

( v I i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Eosinophils 
( B ) CLONE: 285465 

( x i ) SEQUENCE DESCRIPTION: SEQ ED NO:44: 

AAATACTTGA AGGAGTTTAT TATCTACATC AGAATAACAT TGTACACCTT GATTTAAAGC 60 

CACAGAATAT ATTACTGAGC AGCATATACC CTCTCGGGCA CATTAAAATA GTAGATTTTG 120 

GAATCTCTCG AAA A A T AGGG CATGCGTGTG AACTTCGGGA AATCATGGGA ACACCAGAAT 180 
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ATTTAOCTCC AGAAATCCTG AACTATGATC CCATTACCAC AGCAACAGAT ATGTGG A AT A 



2 4 0 



TTGCTA T A A T AGCATATATG TTGTTAACTC ACACATCACC ATTTGTGGGA GAAGATAATC 



3 0 0 



AAGAAACATA CCTCAATATC TCTCAAGTTA AT GT AG ATT A TTCGGAAGGA ACTTTTTCAT 



3 6 0 



CAGTTTCACA GCTGGCACAG ACTTTATTCA GAGCTTTTAG TAAAATCAGA GGAAAGGCCC 



4 2 0 



ACAGC 



4 2 5 



( 2 ) INFORMATION FOR SEO ID NO:45: 



( i ) SEQUENCE CHARACTERISTICS: 



( A ) LENGTH: 1851 base pain 
( B )TYPE: nucleic acid 



( C ) STRAND EDNESS: single 
( D ) TOPOLOGY: linear 



( i i ) MOLECULE TYPE: cDNA 

( v i i ) IMMEDIATE SOURCE: 

( A ) LIBRARY: Stomach 
( B ) CLONE: 214915E 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO:45: 

GCCCGT TGGG CCGCGAACGC AGCCGCCACG CCGGGGCCGC CGAGATCGGG TGCCCGGGAT 60 

GAGCCTCATC CGGAAAAAGG GCTTCTACAA GCAGGACGTC AACAAGACCG CC TGGG AG CT 120 

GCCCAAGACC TACGTGTCCC CGACGCACGT CGGCAGCGGG GCCTATGGCT CCGTGTG CT C 180 

GGCCATCGAC AAGCGGTCAG GGGAGAAGGT GGCCATCAAG AAGCTGAGCC GACCCTTTCA 240 

GTCCGAGATC TTCGCCAAGC GCGCCTACCG GGAGCTGCTG TTGCTGAAGC ACATGCAGCA 300 

TGAGAACGTC ATTGGGCTCC TGGATGTCTT CACCCCAGCC TCCTCCCTGG AACTTCTATG 360 

ACTTCTACCT GGTGATGCCC TTCATGCAGA CCGATCTGCA GAAGATCATG GGGATGGAGT 420 

TCAGTGAGGA GAAGATCCAG TACCTGGTGT AT C AGATGCT CAAAGGCCTT AAGTACATCC 480 

ACTCTGCTGG GGTCGTGCAC AGGGA CCTG A AGCC AGGCAA CCTGGCTGTG AATGAGGACT 540 

GTGAACTGAA GATTCTGGAT TTGGGGCTGG CGCGACATGC AG AC GC CG AG ATGACTGGCT 600 

ACGTGGTGAC CCGCTGGT AC CGAGCCCCCG AGGTG AT CCT CAGCTGGATG CACTACAACC 660 

AGAC AGTGGA CATCTGGT CT GTGGGCTGT A TCATGGCAGA GATGCTGACA GGGAAAACTC 720 

TGTTCAAGGG GAAAGATTAC CTGGACCAGC TGACCC AGAT CCTGAAAGTG ACCGGGGTGC 780 

CTGGCACGGA GTTTGTGCAG AAGCTGAACG ACAAAGCGGC C A A ATCCTAC ATCCAGTCCC 840 

TGCCACAGAC CCCCAGGAAG GATTT CACTC AGCTGTTCCC ACGGGCCAGC CCCCAGCCTG 900 

CGGACCTGCT GGAGAAGATG CTGGAGCTAG ACGTGGACAA GCGCCTGACG GCCGCGCAGG 960 

CCCTCACCCA TCCCTTCTTT GAACCCTTCC GGGACCCTGA GGAAGAGACG GAGGCCCAGC 1020 

AGCCGTTTGA TGATTCCTTA GAACACGAGA A AC TC AC A GT GGATGAATGG AAGCAGCACA 1080 

TCTACAAGGA GATTGTGAAC TTCAGCCCCA TTGCCCGGAA GGACTCACGG CGCCGGAGTG 1140 

GCATGAAGCT GTAGGGACTC AT CTTGCATG GCACCGCCGG CCAGACACTG CCCAAGGACC 1200 

AGT ATTTGTC ACTACCAAAC TCAGCCCTTC TTGGAATACA GCCTTTCAAG CAGAGGACAG 1260 

AAGGGTCCTT CTCCTT ATGT GGGAAATGGG CC T AGT AGAT GCAGAATTCA AAGATGTCGG 1320 

TTGGGAGAAA CTAGCTCTGA TCCTAACAGG C C A C G T T A A A CTGCCCATCT GCAGAATCGC 1380 

CTGCAGGTGG GGCCCTTTCC TTCCCGCCAG AGTGGGGCTG AGTGGGCGCT GAGCCAGGCC 1440 

GGGGGCCTAT GGCAGTGATG CTGTGTTGGT TTCCTAGGGA TGCTCTAACG AATTACCACA 1500 

AACCTGGTGG ATTGAAACAG CAGAACTTGA TTCCCTTACA GTTCTGGAGG CTGGAAATCT 1560 
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-coatinued 




Do 




CCGATCC AGG 


TGTTGGCAGG 


GCTGTGGT CC 


CTT TGAAGGC 


TCTGGGGAAG 


AATCCTTCCT 


16 2 0 


TGGCTCTTTT 


TAGCTTGTGG 


CGGCAGTGGG 


CAGTCCGTGG 


CATTCCCCAG 


CTTATTGCTG 


16 8 0 


CATCACTCCA 


GTCTCTGTCT 


CTTCTGTTCT 


CTCCTCTTTT 


AACAACAGTC 


ATTGGATTTA 


17 4 0 


CCGCCCACCC 


TAATCCTGTG 


TGAT CTTATC 


TTGATCCTTA 


TTAAT TAAAC 


CTGC A A AT AC 


18 0 0 


TCTAGTTCCA 


AATAAAGTCA 


CATTCTCAGG 


T A A A AAA AAA 


AAAAAAAAA A 


A 


18 5 1 



We claim: 

1. A purified polynucleotide having a nucleic acid 
sequence selected from the group consisting of SEQ ID 
NO:l, SEQ ID N0:2, SEQ ID NO:3, SEQ ID NO: 4, SEQ 
ID NO:5, SEQ ID NO:6, SEQ ID NO:7, SEQ ID NO:8, SEQ 
ID NO:9, SEQ ID NO:10, SEQ ID NOrll, SEQ ID NO:12, 
SEQ ID NO:13, SEQ ID N0:14, SEQ ID NO;15, SEQ ID 
NO:16, SEQ ID NO:17, SEQ ID NO:18, SEQ ID N0:19, 
SEQ ID NO:20, SEQ ID NO:21, SEQ ID NO:22, SEQ ID 
NO:23, SEQ ID NO:24, SEQ ID NO:25, SEQ ID NO:26, 
SEQ ID NO:27, SEQ ID NO:28, SEQ ID NO:29, SEQ ID 
NO:30, SEQ ID NO:31, SEQ ID NO:32, SEQ ID NO:33, 
SEQ ID NO:34, SEQ ID NO:35, SEQ ID NO:36, SEQ ID 



NO:37, SEQ ID NO:38, SEQ ID NO:39, SEQ ID NO:40, 
SEQ ID NO:41, SEQ ID NO:42, SEQ ID NO:43, and SEQ 
ID NO:44. 

2. An expression vector comprising the polynucleotide of 
claim 1. 

3. A host cell transformed with the expression vector of 
claim 2. 

4. A method for producing and purifying a polypeptide, 
said method comprising the steps of: 

a) culturing the host cell of claim 3 under conditions 
suitable for the expression of the peptide; and 

b) recovering the polypeptide from the host cell culture. 

* * + * * 
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SECRETED PROTEINS AND 
POLYNUCLEOTIDES ENCODING THEM 

FIELD OF THE INVENTION 

The present invention provides novel polynucleotides and 
proteins encoded by such polynucleotides, along with 
therapeutic, diagnostic and research utilities for these poly- 
nucleotides and proteins. 

BACKGROUND OF THE INVENTION 

Technology aimed at the discovery of protein factors 
(including e.g., cytokines, such as lymphokines, interferons, 
CSFs and interleukins) has matured rapidly over the past 
decade. The now routine hybridization cloning and expres- 
sion cloning techniques clone novel polynucleotides 
"directly" in the sense that they rely on information directly 
related to the discovered protein (ie., partial DNA/amino 
acid sequence of the protein in the case of hybridization 
cloning; activity of the protein in the case of expression 



cloning; activity ot toe proiein in uie case ox c^«^u ^ of: 
cloning). More recent "indirect" cloning techniques such as the ami 



(k) a polynucleotide which encodes a species homologue 

of the protein of (h) or (i) above. 
Preferably, such polynucleotide comprises the nucleotide 
sequence of SEQ ID NO:l from nucleotide 247 to nucle- 
5 otide 432; the nucleotide sequence of SEQ ID NO:l from 
nucleotide 328 to nucleotide 432; the nucleotide sequence of 
the full length protein coding sequence of clone BD372_^ 
deposited under accession number ATCC 98146; or the 
nucleotide sequence of the mature protein coding sequence 
io of clone BD372_Ji deposited under accession number 
ATCC 98146. In other preferred embodiments, the poly- 
nucleotide encodes the full length or mature protein encoded 
by the cDNA insert of clone BD372_5 deposited under 
accession number ATCC 98146. 
15 Other embodiments provide the gene corresponding to the 
cDNA sequence of SEQ ID NO:l or SEQ ID N03. 

Id other embodiments, the present invention provides a 
composition comprising a protein, wherein said protein 
comprises an amino acid sequence selected from the group 



signal sequence cloning, which isolates DNA sequences 
based on the presence of a now well-recognized secretory 
leader sequence motif, as well as various PCR-based or low 
stringency hybridization cloning techniques, have advanced 
the state of the art by making available large numbers of 
DNA/amino acid sequences for proteins that are known to 
have biological activity by virtue of their secreted nature in 
the case of leader sequence cloning, or by virtue of the cell 
or tissue source in the case of PCR-based techniques. It is to 
these proteins and the polynucleotides encoding them that 
the present invention is directed. 

SUMMARY OF THE INVENTION 



25 



30 



35 



45 



50 



In one embodiment, the present invention provides a 
composition comprising an isolated polynucleotide selected 
from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:l; 

(b) a polynucleotide comprising the nucleotide sequence 40 
of SEQ ID NO:l from nucleotide 247 to nucleotide 
432; 

(c) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:l from nucleotide 328 to nucleotide 
432; 

(d) a polynucleotide comprising the nucleotide sequence 
of the full length protein coding sequence of clone 
BD372_^5 deposited under accession number ATCC 
98146; 

(e) a polynucleotide encoding the full length protein 
encodedby the cDNAihsert of doneBD372_5 depos- 
ited under accession number ATCC 98146; 

(f) a polynucleotide comprising the nucleotide sequence 
of the mature protein coding sequence of clone 
BD372_5 deposited under accession number ATCC 
98146; 

(g) a polynucleotide encoding the mature protein encoded 
by the cDNA insert of clone BD372_^ deposited under 
accession number ATCC 98146; 

(h) a polynucleotide encoding a protein comp ri sing the 
amino acid sequence of SEQ ID NO:2; 

(i) a polynucleotide encoding a protein comprising a 
fragment of the amino acid sequence of SEQ ID N02 
having biological activity; 

(j) a polynucleotide which is an allelic variant of a 
polynucleotide of (aHg) above; 



(a) the amino acid sequence of SEQ ID NO:2; 

(b) fragments of the amino acid sequence of SEQ ID 
NO:2; and 

(c) the amino add sequence encoded by the cDNA insert 
of clone 

BD372_5 deposited under accession number ATCC 
98146; the protein being substantially free from other mam- 
malian proteins. Preferably such protein comprises the 
amino acid sequence of SEQ ID NO:2. 

In one embodiment, the present invention provides a 
composition comprising an isolated polynucleotide selected 
from the group consisting of: 

(a) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:4; 

(b) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:4 from nucleotide 316 to nucleotide 
501; 

(c) a polynucleotide comprising the nucleotide sequence 
of the full length protein, coding sequence of clone 
BR533__4 deposited under accession number ATCC 
98146; 

(d) a polynucleotide encoding the full length protein 
encoded by the cDNA insert of clone BR533_4 depos- 
ited under accession number ATCC 98146; 

(e) a polynucleotide comprising the nucleotide sequence 
of the mature protein coding sequence of clone 
BR533_4 deposited under accession number ATCC 
98146; 

(f) a polynucleotide encoding me mature protein encoded 
by the cDNAinsert of clone BR533__4 deposited under 
accession number ATCC 98146; 

(g) a polynucleotide encoding a protein comprising the 
amino acid sequence of SEQ ID NO:5; 

(h) a polynucleotide encoding a protein comprising a 
fragment of the amino acid sequence of SEQ ID NO:5 
having biological activity; 

(i) a polynucleotide which is an allelic variant of a 
polynucleotide of (aM<*) above; 

(j) a polynucleotide which encodes a species homologue 

of the protein of (g) or (h) above. 
Preferably, such polynucleotide comprises the nucleotide 
sequence of SEQ ID NO:4 from nucleotide 316 to nucle- 
65 otide 501; the nucleotide sequence of the full length protein 
coding sequence of clone BR533_4 deposited under acces- 
sion number ATCC 98146; or the nucleotide sequence of the 
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mature protein coding sequence of clone BR533_4 depos- Other embodiments provide the gene corresponding to the 

ited under accession number ATCC 98146. In other pre- cDNA sequence of SEQ ID NO:7. 

ferred embodiments, the polynucleotide encodes the full In other embodiments, the present invention provides a 

length or mature protein encoded by the cDNA insert of composition comprising a protein* wherein said protein 

clone BR533_4 deposited under accession number ATCC 5 comprises an amino acid sequence selected from the group 

98146. consisting of: 

Omer embodiments provide the gene corresponding to the - . me amin0 add sequence of SEQ ID NO:8; 

cDNA sequence of SEQ ID NO:4 or SEQ ID NO:6 " fi of ^ ■ NQ: g from ^ 

In other embodiments, the present invention provides a amin"acid 77; 

composiUon comprising a protein, wherein said protein l0 # 

comprises an amino acid sequence selected from the group (c) fragments of the ammo acid sequence of SEQ ID 

consisting of: N0:8 '» md 

(a) the amino acid sequence of SEQ ID NO:5; (<0 the amino acid sequence encoded by the cDNA insert 

(b) fragments of the amino acid sequence of SEQ ID °££ 01 !? v . , . - Arrnn 
NO-5- and 15 CC288__9 deposited under accession number ATCC 

(c) the amino acid sequence encoded by the cDNA insert 98146; the protein being substantially free from 

w - . n ' malian proteins. Preferably such protein comprises the 

BR533°4 deposited under accession number ATCC ^ e °f ° f f?Q ID NO:8 or the amino add 

98 146; theproteta being substantially free from other mam- sequence of SEQ ID NO:8 from ammo acid 1 to amino acid 

malian proteins. Preferably such protein comprises the 20 77 ; ... . t . . ,^ t: *. •„ 

amino add sequence of SEQ ID N05. to P n r fened polynudeoUde is 

In one embodiment the present invention provides a V"*** *> « "^^11^^ 

composition comprising an isolated polynudeotide selected *««*» ako provides a host cdj, including bacterial, 

framtne group wnsistog of: V east - "»sect and rnarnmalian cells, transformed with such 

(a) a polynucleotide comprising the nudeotide sequence 25 polynudeotide compositions 

f wnmwn? 5 ^ Processes are also provided for producing a protein, 

°* . , ■■ ^ , *a which comprise: 

W 5 ^o^m^ < ^T pnSin f ^/"m ^Xtf^ (a) g«>wing a culture of the host ceU transformed with 

of SEQ ID N0:7 from nudeotide 113 to nucleotide I ^ mmposisioos - m a suitab i e 

medium* a nd 

(c) a polynudeotide comprising the nudeotide sequence 30 . » . .u u. 

of toe full length protein ceding sequence of done (b) ptnnfying &e protem from the ailture. _ 

CC288_9 depodted under accession number ATCC The protein produced accordmg to isuch me&o^ is abo 

QRi4/v provided by the present invention. Preferred embodiments 

i i *.m - „ ™-^in include those in which the protein produced by such process 

(d) a 35 is a mature form of the prVtein. H 
encoded by the cDNA insert of done COttJ depos- ^ of ^ t 

ited under accession number ATCC 98146, c & ^ CC4ldcall Stable carrier. Composi- 

(e) a polynucleotide comprising the nucleotide sequence ^ comprising ^ antibody which specifically reacts with 
££5 ^T" ? r< f * COding sequence of clone such ^ ^ ^ ^ b the t mYen tion. 
CC288_9 deposited under accession number ATCC ^ „ provided for preventing, treating or 
98146; ameliorating a medical condition which comprises admin- 

(f) a polynucleotide encoding the mature protein encoded Bering t0 a mammalian subject a therapeutically effective 
by the cDNAinsert of clone CC288_9 deposited under amount of a composition comprising a protein of the present 
accession number ATCC 98146; invention and a pharmaceutical^ acceptable carrier. 

(g) a polynucleotide encoding a protein comprising the 45 

amino acid sequence of SEQ ID N0:8; DETAILED DESCRIPTION 

o ISOLATED PROTEINS AND 

fragment of the ammo acid sequence of SEQ ID NO:8 POLYNUCLEOTIDES 
having biological activity; 

(i) a polynucleotide which is an allelic variant of a 50 Nucleotide and amino add sequences are reported betow 

polynudeotide of (aH<9 above; for each done and protein disdosed in the present applica- 

(j) a polynudeotide which encodes a spedes homologue tion. In some instances the sequences are peliminary and 

of the protein of (g) or (h) above. may indude some incorrect or ambiguous bases or amino 

Preferably, such polynudeotide comprises the nucleotide adds. The actual nucleotide sequence of each done can 

sequence of SEQ D N0:7 from nudeotide 113 to nudeotide 55 readily be determined by sequencing of the deposited done 

433; the nudeotide sequence of the full length protein in accordance with known methods. The predicted amino 

coding sequence of done CC288_9 deposited under acces- add sequence (both full length and mature) can then be 

sion number ATCC 98164; or the nudeotide sequence of the determined from such nudeotide sequence. The amino add 

mature protdn coding sequence of clone CC288_9 depos- sequence of the protdn encoded by a particular done can 

ited under accession number ATCC 98146. In other pre- 60 also be determined by expression of the done in a suitable 

fared embodiments, the polynudeotide encodes the full host cell, collecting the protein and determining its 

length or mature protein encoded by the cDNA insert of sequence. 

done CC288__9 deposited under accession number ATCC For each disdosed protein applicants have identified what 

98146. In yet other preferred embodiments, the present they have detennined to be the reading frame best identifi- 

invention provides a polynudeotide encoding a protein 65 able with sequence information available at the time of 

comprising the amino add sequence of SEQ ID NO:8 from filing. Because of the partial ambiguity in reported sequence 

amino add 1 to amino add 77. information, reported protdn sequences indude "Xaa" des- 
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ignatars. These "Xaa M designators indicate either (1) a BLASTX and FASTA search protocols. BR533_4 demon- 
residue which cannot be identified because of nucleotide strated at least some homology with murine semaphorin E 
sequence ambiguity or (2) a stop codon in the determined (X85994, BlastN). BR533„4 also shows at least some 
nucleotide sequence where applicants believe one should not identity with an EST identified as "yy80dl0.s i Homo 
exist (if the nucleotide sequence were determined more 5 sapiens cDNA clone 279859 3'" (N38844, BlastN). Based 
accurately). u P on homology, BR533__4 proteins and each homologous 

As used herein a "secreted" protein is one which, when protein or peptide may share at least some activity, 

expressed in a suitable host cell, is transported across or Clone "CC288 9" 

through a membrane, including transport as a result of signal A polynucleoUo^of the prwent invenUon has been iden- 
sequences in its amino acid sequence. "Secreted" proteins 10 tified as clone "CC288__9 M . CC288_9 was isolated from a 
include without limitation proteins secreted wholly (e.g., human adult brain cDNA library using methods which i are 
soluble proteins) or partially (e.g., receptors) from the cell in selective for cDNAs encoding secreted proteins. CC288_9 
which they are expressed "Secreted" proteins also include is a full-length clone, including the entire a>o^g sequence 
without limitation proteins which are transported across the of a secreted protein (also referred to herein as CC288_9 
membrane of the endoplasmic reticulum. protein"). , . . f/VV)00 0 f1 , . . 
Clone "BD372 5" The nucleotide sequence of CC288_9 as presentty deter- 
A polynucleotide of the present invention has been iden- mined is reported in SEQ ID NO:7. What applicants pres- 
tifted as clone "BD372_5". BD372_5 was isolated from a ently believe to be the proper reading frame and the pre- 
human fetal kidney cDNA library using methods which are dieted amino acid sequence of the CC288_9 protein 
selective for cDNAs encoding secreted proteins. BD372_5 corresponding to the foregoing nucleotide sequence is 
is a full-length clone, including the entire coding sequence 20 reported in SEQ ID NO:8. 

of a secreted protein (also referred to herein as **BD372_J The nucleotide sequence disclosed herein for CC288_9 

protein") was searched against the GenBank database using BLASTA/ 

The nucleotide sequence of the 5' portion of BD372_5 as BLASTX and FASTA search protocols. No hits were found 

presently determined is reported in SEQ ID NO:l. What in the database. 

appHcants presently b .^^ 25 (^nes BD37zl-5» BR533_4 and CC288_9 were depos- 

fce coding region is Sg^^^X ited °° » 1996 ^ to ^ 

dieted aad sequence o^ Collection under accession number XTCC 98146, from 

to the foregoing nucleotide sequence is reported in SEQ ID which ^ done comprising a particular polynucleotide is 

N02. Amino acids 1 to 27 are the predicted leader/signal obtainahle> a ont has been transfected into separate 

sequence, with the predicted mature amino acid sequence 30 bactcrial ^ (£ coU) ^ composite deposit Each clone 

beginning at amino acid 28. Additional nucleotide sequence can ^ removed &om ^ vcctor m wnich it was deposited by 

from the 3' portion of BD372_J, including the poly A tail, is performing an EcoRI/NotI digestion (5 1 cite, EcoRI; 3' cite, 

reported in SEQ ID NO:3. Nofl j to produce me appropriately sized fragment for such 

The EcoRI/NotI restriction fragment obtainable from the done (approximate done size fragment are identified 

deposit containing clone BD372_5 should be approxi- 35 bdow) Bacterial ^ con taining a particular clone can be 

mately 2300 bp. obtained from the composite deposit as follows: 

The nucleotide sequence disclosed heron for BD372.3 M 0 u gonU cleotide probe or probes should be designed to 

was searched against the GenBank database using BLASTA/ ^ sequence mat is for that particular clone. This 

BLASTX and FASTA search protocols. BD372_5 demon- sequence ^ ^ derived from the sequences provided 

strated at least some identity with ESTs identified as 40 hercm< or from a comb ination of those sequences. The 

"yc90fl2.s 1 Homo sapiens cDNA clone 23278 3 m (R39276, sepuence of me ohgonucleotide probe that was used to 

BlastN) and "EST05537 Homo sapiens cDNA clone isolate each m . lGng(h done & identified below, and should 

HFBEM26" (T07647, Fasta). Based upon identity, be most reliable m isolatmg the clone of interest 
BD372_^5 proteins and each identical protein or peptide 

may share at least some activity. 45 
Clone "BR533 AT 

A polynucleotide of the present invention has been iden- 
tified as clone M BR533_4". BR533_4 was isolated from a 
human fetal kidney cDNA library using methods which are 

selective for cDNAs encoding secreted proteins. BR533_4 50 

is a full-length clone, including the entire coding sequence u- u- r-i XT „♦ 

of a secreted protein (also referred to herein as "BR533_4 In the sequences listed above which include ^an N at position 

t . m 2, that position is occupied in preferred probes/primers by a 

The nucleotide sequence of the 5' portion of BR533_4 as biotinylated phosphoaramidite residue rafter than a nucle- 

presently determined is reported in SEQ ID NO:4. What 55 otide (such f£ ** 

applicants presently believeis the proper reading frame for phosphoramidite C 1 ;^^"^ 

the coding region is indicated in SEQID NO*. The pre- ammobutyl>^^ 

dieted acid sequence of me BR533_4 protem corresponding pho^horamadite) (G en Research, 

to the foregoing nucleotide sequence is reported in SEQ ID The design of the oligonucleotide probe should preferably 

NO:5. Additional nucleotide sequence from the 3' portion of 60 *>Uow mese parameters: 

BR533_4, including the polyA tail, is reported in SEQ ID (a) It should be designed to an area of the sequence which 

n q.£ has the fewest ambiguous bases ("N's"), if any; 

The EcoRI/NotI restriction fragment obtainable from the (b) It should be designed to have a T m of approx. 80° C 

deposit containing clone BR533_4 should be approximately (assuming 2° for each A or T and 4 degrees far each G 

2850 bp, 65 orQ. 

The nucleotide sequence disclosed herein for BR533_4 The oligonucleotide should preferably be labeled with g- 

was searched against the GenBank database using BLASTA/ ATP (specific activity 6000 CS/mmole) and T4 polynucle- 
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Probe Sequence 


BD372_5 


SEQ ID NO: 9 


BR533_4 


SEQ ID NO: 10 


CC238_9 


SEQ ID NO: 11 
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otide kinase using commonly employed techniques for 
labeling oligonucleotides. Other labeling techniques can 
also be used. Unincorporated label should preferably be 
removed by gel filtration chromatography or other estab- 
lished methods. The amount of radioactivity incorporated 5 
into the probe should be quantitated by measurement in a 
scintillation counter. Preferably, specific activity of the 
resulting probe should be approximately 4e+6 dmp^mole. 

The bacterial culture containing the pool of full-length 
clones should preferably be thawed and 100 ul of the stock 10 
used to inoculate a sterile culture flask containing 25 ml of 
sterile L-broth containing ampicillin at 100 ug/mL The 
culture should preferably be grown to saturation at 37° C, 
and the saturated culture should preferably be diluted in 
fresh L-broth. Aliquots of these dilutions should preferably 15 
be plated to determine the dilution and volume which will 
yield approximately 5000 distinct and well-separated colo- 
nies on solid bacteriological media containing I^broth con- 
taining ampicillin at 100 ug/ml and agar at 1.5% in a 150 
mm petri dish when grown overnight at 37° C. Other known 20 
methods of obtaining distinct, well-separated colonies can 
also be employed. 

Standard colony hybridization procedures should then be 
used to transfer the colonies to nitrocellulose filters and lyse, 
denature and bake them. 25 

The filter is then preferably incubated at 65° C. for 1 hour 
with gentle agitation in 6x SSC (20x stock is 175.3 g 
NaCl/liter, 88.2 g Na citrate/liter, adjusted to pH 7.0 with 
NaOH) containing 0.5% SDS, 100 ug/ml of yeast RNA, and 
10 mM EDTA (approximately 10 mL per 150 mm filter). 30 
Preferably, the probe is then added to the hybridization mix 
at a concentration greater than or equal to le+6 dpm/mL. 
Tht filter is then preferably incubated at 65° C with gentle 
agitation overnight. The filter is then preferably washed in 
500 mL of 2x SSC/0.5% SDS at room temperature without 35 
agitation, preferably followed by 500 mL of 2x SSC/0.1% 
SDS at room temperature with gentle shaking for 15 min- 
utes. A third wash with O.lx SSC/0.5% SDS at 65° C. for 30 
rninutes to 1 hour is optional. The filter is then preferably 
dried and subjected to autoradiography for sufficient time to 40 
visualize the positives on the X-ray film. Other known 
hybridization methods can also be employed. 

The positive colonies are picked, grown in culture, and 
plasmid DNA isolated using standard procedures. The 
clones can then be verified by restriction analysis, hybrid- 45 
ization analysis, or DNA sequencing. 

Fragments of the proteins of the present invention which 
are capable of exhibiting biological activity are also encom- 
passed by the present invention. Fragments of the protein 
may be in linear form or they may be cyclized using known 50 
methods, for example, as described in H. U. Saragovi, et aL, 
Bio/Iechnology 10, 773-778 (1992) and in R. S. McDowell, 
et al., J. Amer. Chem. Soc. 114, 9245-9253 (1992), both of 
which are incorporated herein by reference. Such fragments 
may be fused to carrier molecules such as immunoglobulins 55 
for many purposes, including increasing the valency of 
protein binding sites. For example, fragments of the protein 
may be fused through "linker" sequences to the Fc portion 
of an immunoglobulin. For a bivalent form of the protein, 
such a fusion could be to the Fc portion of an IgG molecule. 60 
Other immunoglobulin isotypes may also be used to gener- 
ate such fusions. For example, a protein — IgM fusion would 
generate a decavalent form of the protein of the invention. 

The present invention also provides both full-length and 
mature forms of the disclosed proteins. The full-length form 65 
of the such proteins is identified in the sequence listing by 
translation of the nucleotide sequence of each disclosed 



clone. The mature form of such protein may be obtained by 
expression of the disclosed full-length polynucleotide 
(preferably those deposited with ATCQ in a suitable mam- 
malian cell or other host cell. The sequence of the mature 
form of the protein may also be determinable from the amino 
acid sequence of the full-length form. 

The present invention also provides genes corresponding 
to the cDNA sequences disclosed herein. The corresponding 
genes can be isolated in accordance with known methods 
using the sequence information disclosed herein. Such meth- 
ods include the preparation of probes or primers from the 
disclosed sequence information for identification and/or 
amplification of genes in appropriate genomic libraries or 
other sources of genomic materials. 

Where the protein of the present invention is membrane- 
bound (e.g., is a receptor), the present invention also pro- 
vides for soluble forms of such protein. In such forms part 
or all of the intracellular and transmembrane domains of the 
protein are deleted such that the protein is fully secreted 
from the cell in which it is expressed. The intracellular and 
transmembrane domains of proteins of the invention can be 
identified in accordance with known techniques for deter- 
mination of such domains from sequence information. 

Species homologs of the disclosed polynucleotides and 
proteins are also provided by the present invention. Species 
homologs may be isolated and identified by making suitable 
probes or primers from the sequences provided herein and 
screening a suitable nucleic acid source from the desired 
species. 

The invention also encompasses allelic variants of the 
disclosed polynucleotides or proteins; that is, naturally- 
occurring alternative forms of the isolated polynucleotide 
which also encode proteins which are identical, homologous 
or related to that encoded by the polynucleotides. 

The isolated polynucleotide of the invention may be 
operably linked to an expression control sequence such as 
the pMT2 or pED expression vectors disclosed in Kaufman 
et al, Nucleic Acids Res. 19, 4485^1490 (1991), in order to 
produce the protein recombinantly. Many suitable expres- 
sion control sequences are known in the art General meth- 
ods of expressing recombinant proteins are also known and 
are exemplified in R. Kaufman, Methods in Enzymology 
185, 537-566 (1990). As defined herein "operably linked** 
means that the isolated polynucleotide of the invention and 
an expression control sequence are situated within a vector 
or cell in such a way that the protein is expressed by a host 
cell which has been transformed (transfected) with the 
ligated polynudeotide/expression control sequence. 

A number of types of cells may act as suitable host cells 
for expression of the protein. Mammalian host cells include, 
for example, monkey COS cells, Chinese Hamster Ovary 
(CHO) cells, human kidney 293 cells, human epidermal 
A431 cells, human Colo205 cells, 3T3 cells, CV-1 cells, 
other transformed primate cell lines, normal diploid cells, 
cell strains derived from in vitro culture of primary tissue, 
primary explants, HeLa cells, mouse L cells, BHK, HL-60, 
U937, HaK or Jurkat cells. 

Alternatively, it may be possible to produce the protein in 
lower eukaryotes such as yeast or in prokaryotes such as 
bacteria. Potentially suitable yeast strains include Saccha- 
romyces cerevisiae, Schizosaccharomyces pombe, 
Kluyveromyces strains, Candida, or any yeast strain capable 
of expressing heterologous proteins. Potentially suitable 
bacterial strains include Escherichia coli. Bacillus subtilis, 
Salmonella typhimurium, or any bacterial strain capable of 
expressing heterologous proteins. If the protein is made in 
yeast or bacteria, it may be necessary to modify the protein 
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produced therein, for example by phosphorylation or gly- compounds and m immunological processes for the devel- 

cosylation of the appropriate sites, in order to obtain the opment of antibodies. 

functional protein. Such covalent attachments may be The proteins provided herein also include proteins char- 

accomplished using known chemical or enzymatic methods. acterized by amino acid sequences similar to those of 

The protemimy also be produced b^ 5 purified proteins but into which modification are naturally 

isolated polynucleotide of the invention to suitable control provided or deliberately engineered For example, modifi- 

sequences in one or more insect expression vectors, and cations in the peptide or DNA sequences can be made by 

employing an insect expression system. Materials and meth- those skilled in the art using known techniques. Modifica- 

ods for baculovirus/insect cell expression systems are com- tions of interest in the protein sequences may include the 

menially available in kit form from, eg., Invitrogen, San 10 alteration, substitution, replacement, insertion or deletion of 

Diego Calif , U.S A. (the MaxBat® kit), and such methods a selected amino acid residue in the coding sequence. For 

are well known in the art, as described in Summers and example, one or more of the cysteine residues may be 

Smith, Texas Agricultural Experiment Station Bulletin No. deleted or replaced with another amino acid to alter the 

1555 (1987), kcorporated herein by reference. As used conformation of the molecule. Techniques for such 

herein, an insect cell capable of expressing a polynucleotide 15 alteration, substitution, replacement, insertion or deletion 

of the present invention is ^transformed" are well known to those skilled in the art (see, eg., U.S. Pat 

TbejffotemoftoekventionrnaybeF^ No. 4,518,584). Preferably, such alteration, substitution, 

transformed host cells under culture conditions suitable to replacement insertion or deletion retains the desired activity 

express the recombinant protein. The resulting expressed of the protein. 

protein may then be purified from such culture (i.t, from 20 Other fragments and derivatives of the sequences of 

culture medium or cell extracts) using known purification proteins which would be expected to retain protein activity 

processes, such as gel filtration and ion exchange chroma- in whole or in part and may thus be useful for screening or 

tography.Thepurfficau\>nofraepro^ other immunological methodologies may also be easily 

affinity column containing agents which will bind to the made by those skilled in the art given the disclosures herein, 

protein; one or more column steps over such affinity resins 25 Such modifications are believed to be encompassed by the 

as concanavalin A-agarose, heparin-toyopearl® or Cibac- present invention. 

rom blue 3GA Sepharose®; one or more steps involving USES and BIOLOGICAL ACTIVITY 
hydrophobic interaction chromatography using such resins 

as phenyl ether, butyl ether, or propyl ether, or immunoaf- The polynucleotides and proteins of the present invention 

finity chromatography. 30 are expected to exhibit one or more of the uses or biological 

Alternatively, the protein of the invention may also be activities (including those associated with assays cited 

expressed in a form which will facilitate purification. For herein) identified below. Uses or activities described for 

example, it may be expressed as a fusion protein, such as proteins of the present invention may be provided by admin- 

those of maltose binding protein (MBP), glutathione-S- istration or use of such proteins or by administration or use 

transferase (GST) or thioredoxin (TRX). Kits for expression 35 of polynucleotides encoding such proteins (such as, for 

and purification of such fusion proteins are commercially example, in gene therapies or vectors suitable for introduc- 

available from New England BioLab (Beverly, Mass.), Phar- tion of DNA). 

macia (Piscataway, N J.) and In Vitrogen, respectively. The Research Uses and Utilities 

protein can also be tagged with an epitope and subsequently The polynucleotides provided by the present invention 

purified by using a specific antibody directed to such 40 can be used by the research community for various purposes, 

epitope. One such epitope ("Flag") is commercially avail- The polynucleotides can be used to express recombinant 

able from Kodak (New Haven, Conn.). F<*ein for analysis, characterization or therapeutic use; as 

Finally, one or more reverse-phase high performance markers for tissues in which the corresponding protein is 
liquid chromatography (RP-HPLC) steps employing hydro- preferentially expressed (either constitutively or at a par- 
phobic RP-HPLC media, e.g., silica gel having pendant 45 ticular stage of tissue differentiation or development or in 
methyl or other aliphatic groups, can be employed to further disease states); as molecular weight markers on Southern 
purify the protein. Some or all of the foregoing purification gels; as chromosome markers or tags (when labeled) to 
steps, in various combinations, can also be employed to identify chromosomes or to map related gene positions; to 
provide a substantially homogeneous isolated recombinant compare with endogenous DNA sequences in patients to 
protein. The protein thus purified is substantially free of 50 identify potential genetic disorders; as probes to hybridize 
other rnammalian proteins and is defined in accordance with and thus discover novel, related DNA sequences; as a source 
the present invention as an •'isolated protein." of information to derive PCR primers for genetic finger- 

The protein of the invention may also be expressed as a printing; as a probe to "subtract-out" known sequences in 

product of transgenic animals, e.g., as a component of the the process of discovering other novel polynucleotides; for 

milk of transgenic cows, goats, pigs, or sheep which are 55 selecting and making oligomers for attachment to a "gene 

characterized by somatic or germ cells containing a nucle- chip" or other support, including for examination of expres- 

otide sequence encoding the protein. sion patterns; to raise anti-protein antibodies using DNA 

The protein may also be produced by known conventional immunization techniques; and as an antigen to raise anti- 
chemical synthesis. Methods for constructing the proteins of DNA antibodies or elicit another immune response. Where 
the present invention by synthetic means are known to those 60 the polynucleotide encodes a protein which binds or poten- 
skOled in the art The synthetically-constructed protein tially binds to another protein (such as, for example, in a 
sequences, by virtue of sharing primary, secondary or ter- receptor-ligand interaction), the polynucleotide can also be 
tiary structural and/or conformational characteristics with used in interaction trap assays (such as, for example, mat 
proteins may possess biological properties in common described in Gyuris et aL, Cell 75:791-803 (1993)) to 
therewith, including protein activity. Thus, they may be 65 identify polynucleotides encoding the other protein with 
employed as biologically active or immunological substi- which binding occurs or to identify inhibitors of the binding 
tutes for natural, purified proteins in screening of therapeutic interaction. 
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The proteins provided by the present invention can simi- nol. 145:1706-1712, 1990; Bertagnolli et al, Cellular 

larly be used in assay to determine biological activity, Immunology 133327-341, 1991; Bertagnolli, et aL, L 

including in a panel of multiple proteins for higji-throughput Immunol. 1493778-3783, 1992; Bowman et al. , L Immu- 

screening; to raise antibodies or to elicit another immune noL 152:1756-1761, 1994. 

response; as a reagent (including the labeled reagent) in 5 Assays for cytokine production and/or proliferation of 

assays designed to quantitatively determine levels of the spleen cells, lymph node cells or thymocytes include, with- 

protein (or its receptor) in biological fluids; as markers for out limitation, those described in: Polyclonal T cell 

tissues in which the corresponding protein is preferentially stimulation, Kruisbeek, A. M. and Shevach, R M. In Cur- 

expressed (either constitutively or at a particular stage of rent Protocols in Immunology. J. E e.a. Coligan eds. Vol 1 

tissue differentiation or development or in a disease state); 10 pp. 3.12.1-3.12.14, John Wiley and Sons, Toronto. 1994; 

and, of course, to isolate correlative receptors or ligands. and Measurement of mouse and human interleuMn y, 

Where the protein binds or potentially binds to another Schreiber, R. D. In Current Protocols in Immunology, J. E. 

protein (such as, for example, in a receptor-ligand e.a. Coligan eds. Vol lpp. 6.8.1-6.8.8. John Wiley and Sons, 

interaction), the protein can be used to identify the other Toronto. 1994. 

protein with which binding occurs or to identify inhibitors of 15 Assays for proliferation and differentiation of hematopoi- 

the binding interaction. Proteins involved in these binding etic and lymphopoietic cells include, without limitation, 

interactions can also be used to screen for peptide or small those described in: Measurement of Human and Murine 

molecule inhibitors or agonists of the binding interaction. Interleukin 2 and InterleuMn 4, Bottomry, K., Davis, L. S. 

Any or all of these research utilities are capable of being and Lipsky, P. E. In Current Protocols in Immunolop. J. E. 

developed into reagent grade or kit format for commercial- 20 e.a. Coligan eds. Vol 1 pp. 6.3.1-63.12, John Wiley and 

ization as research products. Sons, Toronto. 1991; deVries et al., J. Exp. Med. 

Methods for performing the uses listed above are well 173:1205-1211, 1991; Moreau et al., Nature 336:690-692, 

known to those skilled in the art References disclosing such 1988; Greenberger et al., Proc. Natl Acad. Sci. U.S.A. 

methods include without limitation "Molecular Cloning: A 80:2931-2938, 1983; Measurement of mouse and human 

Laboratory Manual", 2d e&, Cold Spring Harbor Laboratory 25 interleukin 6— Nordan, R. In Current Protocols in Immu- 

Press, Sambrook, J., E. R Fritsch andT. Maniatis eds., 1989, nohgy. J. E. e.a. Coligan eds. Vol 1 pp. 6.6.1-^.6.5, John 

and 'Methods in Enzymology: Guide to Molecular Cloning Wiley and Sons, Toronto. 1991; Smith et aL, Proc Natl. 

Techniques", Academic Press, Berger, S. L. and A. R. Aced. ScL ILSA. 83:1857-1861, 1986; Measurement of 

Kimmel eds., 1987. human Interleukin 11 — Bennett, F, Giannotri, J., dark, S. 

Nutritional Uses 30 C. and Turner, K. J. In Current Protocols in Immunology. J. 

Polynucleotides and proteins of the present invention can E. e.a. Coligan eds. Vol 1 pp. 6.15.1 John Wiley and Sons, 

also be used as nutritional sources or supplements. Such uses Toronto. 1991; Measurement of mouse and human Interleu- 

include without limitation use as a protein or amino acid kin 9 — Ciarletta, A., Giannotri, J., Clark, S. G and Turner, 

supplement, use as a carbon source, use as a nitrogen source K J. In Current Protocols in Immunology. J. E. e.a. Coligan 

and use as a source of carbohydrate. In such cases the protein 35 eds. Vol 1 pp. 6.13. 1, John Wiley and Sons, Toronto. 1991. 

or polynucleotide of the invention can be added to the feed Assays for T-cell clone responses to antigens (which will 

of a particular organism or can be administered as a separate identify, among others, proteins that affect APC-T cell 

solid or liquid preparation, such as in the form of powder, interactions as well as direct T-cell effects by measuring 

pills, solutions, suspensions or capsules. In the case of proliferation and cytokine production) include, without 

noimorgamsnis, me protdn or polynucleotide of theinven- 40 limitation, those described in: Current Protocols in 

tion can be added to the medium in or on which the Immunology, Ed by J. E. Coligan, A- M. Kruisbeek, D. H. 

microorganism is cultured. Margulies, E. M. Shevach, W Strober, Pub. Greene Publish- 

Cytokine and Cell IVoliferation/Differentiation Activity ing Associates and Wiley-Interscience (Chapter 3, In Vitro 

A protein of the present invention may exhibit cytokine, assays for Mouse Lymphocyte Function; Chapter 6, Cytok- 

cell proliferation (either inducing or inhibiting) or cell 45 ines and their cellular receptors; Chapter 7, Immunologic 

differentiation (either inducing or inhibiting) activity or may studies in Humans); Weinberger et al., Proc NatL Acad. Sci 

induce production of other cytokines in certain cell popu- USA 77:6091-6095, 1980; Weinberger et al., Eur. J. Immun. 

lations. Many protein factors discovered to date, including 11:405-^11, 1981;Takai et aL, J. ImmunoL 137:3494-3500, 

all known cytokines, have exhibited activity in one or more 1986; Takai et al., J. ImmunoL 140:508-512, 1988. 

factor dependent cell proliferation assays, and hence the 50 Immune Stimulating or Suppressing Activity 

assays serve as a convenient confirmation of cytokine activ- A protein of the present invention may also exhibit 

ity. The activity of a protein of the present invention is irnmune stimulating or immune suppressing activity, includ- 

evidenced by any one of a number of routine factor depen- ing without limitation the activities for which assays are 

dent cell proliferation assays for cell lines including, without described herein, Aprotein may be useful in the treatment of 

limitation, 32D, DA2, DA1G, T10, B9, B9/11, BaF3, MC9/ 55 various irnmune deficiencies and disorders (including severe 

G, M-KpreB M+), 2E8, RB5, DAI, 123, T1165, HT2, conibmed immunodeficiency (SOD)), e.g., in regulating (up 

CIIX2, TF-1, Mo7e and CMK. or down) growth and proliferation of T and/or B 

The activity of a protein of the invention may, among lymphocytes, as well as effecting the cytolytic activity of 

other means, be measured by the following methods: NK cells and other cell populations. These immu ne defi- 

Assays for T-cell or thymocyte proliferation include with- 60 ciencies may be genetic or be caused by vital (eg., HIV) as 

out limitation those described in: Current Protocols in well as bacterial or fungal infections, or may result from 

Immunology, Ed by J. E. Coligan, A. M. Kruisbeek, D. H. autoimmune disorders. More specifically, infectious dis- 

Marguiies, E. M. Shevach, W. Strober, Pub. Greene Pub- eases causes by viral, bacterial, fungal or other infection 

lishing Associates and Wiley-Interscience (Chapter 3, In may be treatable using a protein of the present invention. 

Vitro assays for Mouse Lymphocyte Function 3.1-3.19; 65 including infections by HIV, hepatitis vinises, herpesviruses, 

Chapter 7, Immunologic studies in Humans); Takai et aL, J. mycobacteria, Leishmania spp., malaria spp. and various 

ImmunoL 137:3494-3500, 1986;BertagnolHetaL,J.Immu- fungal infections such as candidiasis. Of course, in this 
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regard, a protein of the present Invention may also be useful 
where a boost to the immune system generally may be 
desirable, i.e., in the treatment of cancer. 

Autoimmune disorders which may be treated using a 
protein of the present invention include, for example, con- 5 
nective tissue disease, multiple sclerosis, systemic lupus 
erythematosus, rheumatoid arthritis, autoimmune pulmo- 
nary inflammation, Guillain-Barre syndrome, autoimmune 
thyroiditis, insulin dependent diabetes mellitb, myasthenia 
gravis, graft-versus-host disease and autoimmune inflam- 10 
matory eye disease. Such a protein of the present invention 
may also to be useful In the treatment of allergic reactions 
and conditions, such as asthma (particularly allergic asthma) 
or other respiratory problems. Other conditions, in which 
immune suppression is desired (including, for example, 15 
organ transplantation), may also be treatable using a protein 
of the present invention. 

Using the proteins of the invention it may also be possible 
to immune responses, in a number of ways. Down regulation 
may be in the form of inhibiting or blocking an immune 20 
response already in progress or may involve preventing the 
induction of an immune response. The functions of activated 
T cells may be inhibited by suppressing T cell responses or 
by inducing specific tolerance in T cells, or both. Immuno- 
suppression of T cell responses is generally an active, 25 
non-antigen-specific, process which requires continuous 
exposure of the T cells to the suppressive agent Tolerance, 
which involves inducing non-responsiveness or anergy in T 
cells, is distinguishable from immunosuppression in that it is 
generally antigen-specific and persists after exposure to the 30 
tolerizing agent has ceased. Operationally, tolerance can be 
demonstrated by the lack of a T cell response upon reexpo- 
sure to specific antigen in the absence of the tolerizing agent 

Down regulating or preventing one or mare antigen 
functions (including without limitation B lymphocyte and- 35 
gen functions (such as, for example, B7)), e.g., preventing 
high level lymphokine synthesis by activated T cells, will be 
useful in situations of tissue, skin and organ transplantation 
and in graft-versus-host disease (GVHD). For example, 
blockage of T cell function should result in reduced tissue 40 
destruction in tissue transplantation, typically, in tissue 
transplants, rejection of the transplant is initiated through its 
recognition as foreign by T cells, followed by an immune 
reaction that destroys the transplant. The administration of a 
molecule which inhibits or blocks interaction of a B7 45 
lymphocyte antigen with its natural ligand(s) on immune 
cells (such as a soluble, monomelic form of a peptide having 
B7-2 activity alone or in conjunction with a monomelic 
form of a peptide having an activity of another B lympho- 
cyte antigen (e.g., B7-1, B7-3) or blocking antibody), prior 50 
to transplantation can lead to the binding of the molecule to 
the natural ligand(s) on the immune cells without transmit- 
ting the corresponding co stimulatory signal Blocking B 
lymphocyte antigen function in this matter prevents cytokine 
synthesis by immune cells, such as T cells, and thus acts as 55 
an immunosuppressant. Moreover, the lack of costimulation 
may also be sufficient to anergize the T cells, thereby 
inducing tolerance in a subject Induction of long-term 
tolerance by B lymphocyte antigen-blocking reagents may 
avoid the necessity of repeated administration of these 60 
blocking reagents. To achieve sufficient immunosuppression 
or tolerance in a subject, it may also be necessary to block 
the function of a combination of B lymphocyte antigens. 

The efficacy of particular blocking reagents in preventing 
organ transplant rejection or GVHD can be assessed using 65 
animal models that are predictive of efficacy in humans. 
Examples of appropriate systems which can be used include 
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allogeneic cardiac grafts in rats and xenogeneic pancreatic 
islet cell grafts in mice, both of which have been used to 
examine the immunosuppressive effects of CTLA4Ig fusion 
proteins in vivo as described in Lenschow ct aL, Science 
257:789-792 (1992) and Turka et al., Proc. Natl. Acad. Sd 
USA, 89:11102-11105 (1992). In addition, murine models 
of GVHD (see Paul ed, Fundamental Immunology, Raven 
Press, New York, 1989, pp. 846-847) can be used to 
determine the effect of blocking B lymphocyte antigen 
function in vivo on the development of that disease. 

Blocking antigen function may also be therapeutically 
useful for treating autoimmune diseases. Many autoimmune 
disorders are the result of inappropriate activation of T cells 
that are reactive against self tissue and which promote the 
production of cytokines and autoantibodies involved in the 
pathology of the diseases. Preventing the activation of 
autoreactive T cells may reduce or eliminate disease symp- 
toms. Administration of reagents which block costimulation 
of T cells by disrupting receptorJigand interactions of B 
lymphocyte antigens can be used to inhibit T cell activation 
and prevent production of autoantibodies or T cell-derived 
cytokines which may be involved in the disease process. 
Additionally, blocking reagents may induce antigen-specific 
tolerance of autoreactive T cells which could lead to long- 
term relief from the disease. The efficacy of blocking 
reagents in preventing or alleviating autoimmune disorders 
can be determined using a number of well-characterized 
animal models of human autoimmune diseases. Examples 
include murine experimental autoimmune encephalitis, sys- 
temic lupus erythmatosis in MRL/lpr/lpr mice or NZB 
hybrid mice, murine autoimmune collagen arthritis, diabetes 
mellitus in NOD mice and BB rats, and murine experimental 
myasthenia gravis (see Paul ed., Fundamental Immunology, 
Raven Press, New York, 1989, pp. 840-856). 

Upregulation of an antigen function (preferably a B 
lymphocyte antigen function), as a means of up regulating 
immune responses, may also be useful in therapy. Upregu- 
lation of immune responses may be in the form of enh an ci ng 
an existing immune response or eliciting an initial immune 
response. For example, enhancing an immune response 
through stimulating B lymphocyte antigen function may be 
useful in cases of viral infection. In addition, systemic viral 
diseases such as influenza, the common cold, and encepha- 
litis might be alleviated by the administration of stimulatory 
forms of B lymphocyte antigens systemically. 

Alternatively, anti-vital immune responses may be 
enhanced in an infected patient by removing T cells from the 
patient, costimulating the T cells in vitro with viral antigen- 
pulsed APCs either expressing a peptide of the present 
invention or together with a stimulatory form of a soluble 
peptide of the present invention and reintroducing the in 
vitro activated T cells into the patient Another method of 
enhancing anti-viral immune responses would be to isolate 
infected cells from a patient, transfect them with a nucleic 
acid encoding a protein of the present invention as described 
herein such that the cells express all or a portion of the 
protein on their surface, and reintroduce the transfected cells 
into the patient The infected cells would now be capable of 
delivering a costirmilatory signal to, and thereby activate, T 
cells in vivo. 

In another application, up regulation or enhancement of 
antigen function (preferably B lymphocyte antigen function) 
may be useful in the induction of tumor immunity. Tumor 
cells (e.g., sarcoma, melanoma, lymphoma, leukemia, 
neuroblastoma, carcinoma) transfected with a nucleic acid 
encoding at least one peptide of the present invention can be 
administered to a subject to overcome tumor-specific toler- 
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ance in the subject If desired, the tumor cell can be Mixed lymphocyte reaction (MLR) assays (which will 

transfected to express a combination of peptides. For identify, among others* proteins that generate predorninantly 

example, tumor cells obtained from a patient can be trans- Thl and CTL responses) include, without limitation, those 

fected ex vivo with an expression vector directing the described in: Current Protocols in Immunology, Ed by J. E. 

expression of a peptide having B7-2-like activity alone, or in 5 Coligan, A. M. Kruisbeek, D. H. Margulies, E. M. Shevach. 

conjunction with a peptide having B7-l-like activity and/or W. Strober. Pub. Greene Publishing Associates and Wiley- 

B7-3-like activity. The transfected tumor cells are returned Interscience (Chapter 3, In Vitro assays for Mouse Lympho- 

to the patient to result in expression of the peptides on the cyte Function 3.1-3.19; Chapter 7, Immunologic studies in 

surface of the transfected cell. Alternatively, gene therapy Humans); Takai et al., J. Immunol 137:3494-3500, 1986; 
techniques can be used to target a tumor cell for transfection 10 Takai et aL, J. ImmunoL 140:508-512, 1988; Bertagnolli et 

in vivo. al., J. Immunol. 1493778-3783, 1992. 

The presence of the peptide of the present invention Dendritic cell-dependent assays (which will identify, 

having the activity of a B lymphocyte antigen(s) on the among others, proteins expressed by dendritic cells that 

surface of the tumor cell provides the necessary costimula- activate naive T-cells) include, without limitation, those 
tion signal to T cells to induce a T cell mediated immune 15 described in: Guery et aL, J. Immunol. 134:536-544. 1995; 

response against the transfected tumor cells. In addition, Inaba et al., Journal of Experimental Medicine 

tumor cells which lack MHC class I or MHC class n 173:549-559, 1991; Macatonia et aL Journal of Immunol- 

molecules, or which fail to reexpress sufficient mounts of ogy 154:5071-5079, 1995; Porgador et al., Journal of 

MHC class I or MHC class II molecules, can be transfected Experimental Medicine 182:255-260, 1995; Nair et al., 
with nucleic acid encoding all or a portion of (e.g., a 20 Journal of Virology 67:4062-4069, 1993; Huang et al., 

cytopUsmic-domain truncated portion) of an MHC class I a Science 264:961-965, 1994; Macatonia et aL, Journal of 

chain protein and p 2 microglobulin protein or an MHC class Experimental Medicine 169:1255-1264, 1989; Bhardwaj et 

H a chain protein and an MHC class II P chain protein to al.. Journal of Clinical Investigation 94:797-807, 1994; and 

thereby express MHC clas si or MHC class H proteins on the Inaba et al., Journal of Experimental Medicine 

cell surface. Expression of the appropriate class I or class II 25 172:631-640, 1990. 

MHC in conjunction with a peptide having the activity of a Assays for lymphocyte survival/apoptosis (which will 

B lymphocyte antigen (e.g., B7-1, B7-2, B7-3) induces a T identify, among others, proteins that prevent apoptosis after 

cell mediated immune response against the transfected superantigen induction and proteins that regulate lympho- 

tumor celL Optionally, a gene encoding an antisense con- cyte homeostasis) include, without limitation, those 

struct which blocks expression of an MHC class n associ- 30 described in: Darzynkiewicz et al., Cytometry 13:795-808. 

ated protein, such as the invariant chain, can also be cotrans- 1992; Gorczyca et aL, Leukemia 7:659-670, 1993; Gorc- 

fected with a DNA encoding a peptide having the activity of zyca et al., Cancer Research 53:1945-1951, 1993; Itoh et al., 

a B lymphocyte antigen to promote presentation of tumor Cell 66:233-243, 1991; Zacharchuk, Journal of Immunol- 

associated antigens and induce tumor specific immunity. ogy 145:4037-4045, 1990; Zamai et al.. Cytometry 

Thus, the induction of a Tcell mediated immune response in 35 14:891-897, 1993; Gorczyca et al.. International Journal of 

a human subject may be sufficient to overcome tumor- Oncology 1:639-648, 1992. 

specific tolerance in the subject. Assays for proteins that influence early steps of T-cell 

The activity of a protein of the invention may, among commitment and development include, without limitation, 

other means, be measured by the following methods: those described in: Antica et aL, Blood 84:111-117, 1994; 

Suitable assays for thymocyte or splenocyte cytotoxicity 40 Fine et al., Cellular Immunology 155:111-122, 1994; Gary 
include, without limitation, those described in: Current et aL, Blood 85:2770-2778, 1995; Toki et aL, Proc Nat 
Protocols in Immunology, Ed by J. E. Coligan, A. M. Acad Sci. USA 88:7548-7551, 1991. 
Kruisbeek, D. H. Margulies, E. M. Shevach, W. Strober, Hematopoiesis Regulating Activity 
Pub. Greene Publishing Associates and WUey-Intersdence A protein of the present invention may be useful in 
(Chapter 3, In Vitro assays for Mouse Lymphocyte Function 45 regulation of hematopoiesis and, consequently, in the treat- 
3.1-3.19; Chapter 7, Immunologic studies in Humans); ment of myeloid or lymphoid cell deficiencies. Even mar- 
Herrmann et al., Proc. NatL Acad. ScL USA 78:2488-2492, ginal biological activity in support of colony fonning cells 
1981; Herrmann et al., J. Immunol. 128:1968-1974, 1982; or of factor-dependent cell lines indicates involvement in 
Handa et al., J. Immunol. 135:1564-1572, 1985; Takai etaL, regulating hematopoiesis, e.g. in supporting the growth and 
L Immunol. 1373494-3500, 1986; Takai et al., J. ImmunoL 50 proliferation of erythroid progenitor cells alone or in com- 
140:508-512, 1988; Herrmann et al., Proa Natl. Acad ScL bination with other cytokines, thereby indicating utility, for 
USA 78:2488-2492, 1981; Herrmann et aL, J. ImmunoL example, in treating various anemias or fox use in conjunc- 
128:1968-1974, 1982; Handa et al., J. Immunol. tion with irradiation/chemotherapy to stimulate the produc- 
135:1564-1572, 1985; Takai et al., J. Immunol. tion of erythroid precursors and/or erythroid cells; in sup- 
137:3494-3500, 1986; Bowmanet al., J. Virology 55 porting the growth and proliferation of myeloid cells such as 
61:1992-1998; Takai etal., J. Immunol. 140:508-512,1988; granulocytes and monocytes/maaophages (i.e., traditional 
BertagnoltietaL.CeUularltrimunology 133327-341,1991; CSF activity) useful, for example, in conjunction with 
Brown et aL, J. ImmunoL 1533079-3092, 1994. chemotherapy to prevent or treat consequent myelo- 
Assays for T-cell-dependent immunoglobulin responses suppression; in supporting the growth and proliferation of 
and isotypc switching (which will identify, among others, 60 megakaryocytes and consequently of platelets thereby 
proteins that modulate T-cell dependent antibody responses allowing prevention or treatment of various platelet dis Gr- 
and that affect Thl/Th2 profiles) include, without limitation, ders such as thrombocytopenia, and generally for use in 
those described in: Maliszewski, J. Immunol. place of or complimentary to platelet transfusions; and/orin 
1443028-3033, 1990; and Assays for B cell function: In supporting the growth and proliferation of hematopoietic 
vitro antibody production, Monl J. J. and Brunswick. M. In 65 stem cells which are capable of maturing to any and all of 
Current Protocols in Immunology, J. E. ca. Coligan eds. Vol the above-mentioned hematopoietic cells and therefore find 
1 pp. 3.8,1-3.8.16, John Wiley and Sons, Toronto. 1994. therapeutic utility in various stem cell disorders (such as 
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those usually treated with transplantation, including, without (collagenase activity, osteoclast activity, etc.) mediated by 

limitation* aplastic anemia and paroxysmal nocturnal inflammatory processes. 

hemoglobinuria), as well as in repopulating the stem cell Another category of tissue regeneration activity mat may 

compartment post irradiation/chemotherapy, either in-vivo be attributable to the protein of the present invention is 
cr^vo(i.e^m conjunction wi^ 5 tendo^hgament formation. A protein of toe present 

tation or with peripheral progenitor cell tmsplantotion Mention, which induces tendon^ament-ute tissue or 

union ui wiui pciipiiaoi piv^mwi *u y formation in circumstances where such tissue is 

(homologous or heterologous)) as normal cells or geneti- ^ fQm ^ has ^ m ^ hcalmg of 

cally manipulated for gene therapy. tendoQ QT Uamcnt tears> deformities and other tendon or 

The activity of a protein of the invention may, among Ugamen{ de f ects m humans and other animals. Such a 

other means, be measured by the following methods: 10 reparation employing a tendon/ligament-like tissue induc- 

Suitable assays for proliferation and differentiation of fogprotemrnayriaverjrophyh^ 

various hematopoietic lines are cited above. ^ t0 ten( ion or ligament tissue, as well as use in the improved 

Assays for embryonic stem cell differentiation (which will fixation of tendon or ligament to bone or other tissues, and 

identify, among others, proteins that influence embryonic m p airing defects to tendon or ligament tissue. De novo 

differentiation hematopoiesis) include, without limitation, 15 tendon/ligament-like tissue formation induced by a compo- 

those described in: Johansson et al. Cellular Biology s ition of the present invention contributes to the repair of 

15:141-151, 1995; Keller et al., Molecular and Cellular congenital, trauma induced, or other tendon or ligament 

Biology 13:473-486, 1993; McClanahan et aL, Blood defects of other origin, and is also useful in cosmetic plastic 

8 1 :2903-2915 , 1993. surgery for attachment or repair of tendons or ligaments. The 

Assays for stem cell survival and differentiation (which 20 compositions of the present invention may provide environ- 
will identify, among others, proteins that regulate lympho- ment to attract tendon- or hgament-feming cells, stimulate 
hematopoiesis) include, without limitation, those described growth of tendon- or ngament-f crrning cells, induce differ- 
in: Methylcellulose colony forming assays, Freshney, M. G. entiation of progenitors of tendon- or Ugament-fanning 
In Culture of Hematopoietic Cells. R. L Freshney, et al. eds. cells, or induce growth of tendon/ligament cells or progeni- 
Vol pp. 265-268, Wiley-Iiss, Inc., New York, N.Y. 1994; 25 tors ex vivo for return in vivo to effect tissue repair. The 
Hirayama et al., Proc. Natl. Acad. Sci. USA 89:5907-5911, compositions of the invention may also be useful in the 
1992; Primitive hematopoietic colony forming cells with treatment of tendinitis, carpal tunnel syndrome and other 
high proliferative potential, McNiece, L K. and Briddell, R. tendon or ligament defects. The compositions may also 
A. In Culture of Hematopoietic Cells. R. L Ereshney, et al. include an appropriate matrix and/or sequestering agent as a 
eds. Vol pp. 23-39, Wiley-Iiss, Inc., New York N.Y. 1994; 30 carrier as is well known in the art 
Neben et al.. Experimental Hematology 22:353-359, 1994; The protein of the present invention may also be useful for 
Cobblestone area forming cell assay, Hoemacher, R. E.Th proliferation of neural cells and far regeneration of nerve 
Culture of Hematopoietic Cells. R. L Freshney, et aL eds. Vol and brain tissue, i.e. for the treatment of central and periph- 
pp. 1-21, Wiley-Iiss, Inc., New York, N.Y. 1994; Long term eral nervous system diseases and neuropathies, as well as 
bone marrow cultures in the presence of stromal cells, 35 mechanical and traumatic disorders, which involve 
Spooncer, R, Dexter, M. and Allen, T. In Culture ofHemato- degeneration, death or trauma to neural cells or nerve tissue. 
poietic Cells. R. L Freshney, et aL eds. Vol pp. 163-179, More specifically, a protein may be used in the treatment of 
Wiley-Iiss, Inc., New York, N.Y. 1994; Long term culture diseases of the peripheral nervous system, such as peripheral 
initiating cell assay, Sutherland, H. J. In Culture ofHemato- nerve injuries, peripheral neuropathy and localized 
poletic Cells. R. L freshney, et aL eds. Vol pp. 139-162, 40 neuropathies, and central nervous system diseases, such as 
Wiley-Iiss, Inc., New York, N.Y. 1994. Alzheimer's, Parkinson's disease, Huntington's disease, 
Tissue Growth Activity amyotrophic lateral sclerosis, and Shy-Drager syndrome. 

A protein of the present invention also may have utility in Further conditions which may be treated in accordance with 

compositions used for bone, cartilage, tendon, ligament the present invention include mechanical and traumatic 

and/or nerve tissue growth or regeneration, as well as for 45 disorders, such as spinal cord disorders, head trauma and 

wound healing and tissue repair and replacement, and in the cerebrovascular diseases such as stroke. Peripheral neuro- 

treatment of burns, incisions and ulcers. pathies resulting from chemotherapy or other medical thera- 

A protein of the present invention, which induces carti- pies may also be treatable using a protein of the invention, 

lage and/or bone growth in circumstances where bone is not Proteins of the invention may also be useful to promote 

normally formed, has application in the healing of bone 50 better or faster closure of non-healing wounds, including 

fractures and cartilage damage or defects in humans and without limitation pressure ulcers, ulcers associated with 

other animals. Such a preparation employing a protein of the vascular insufficiency, surgical and traumatic wounds, and 

invention may have prophylactic use in closed as well as the like. 

open fracture reduction and also in the improved fixation of It is expected that a protein of the present invention may 

artificial joints. De novo bone formation induced by an 55 also exhibit activity far generation or regeneration of other 

osteogenic agent contributes to the repair of congenital, tissues, such as organs (including, far example, pancreas, 

trauma induced, or oncologic resection induced craniofacial liver, intestine, kidney, skin, endothelium), muscle (smooth, 

defects, and also is useful in cosmetic plastic surgery. skeletal or cardiac) and vascular (including vascular 

A protein of this invention may also be used in the endothelium) tissue, or for promoting the growth of cells 

treatment of periodontal disease, and in other tooth repair 60 comprising such tissues. Part of the desired effects may be 

processes. Such agents may provide an environment to by inhibition or modulation of fibrotic scarring to allow 

attract bone-forming cells, stimulate growth of bone- normal tissue to regenerate. A protein of the invention may 

forming cells or induce differentiation of progenitors of also exhibit angiogenic activity, 

bone-forming cells. A protein of the invention may also be A protein of the present invention may also be useful for 

useful in the treatment of osteoporosis or osteoarthritis, such 65 gut protection or regeneration and treatment of lung or liver 

as through stimulation of bone and/or cartilage repair or by fibrosis, reperfusion injury in various tissues, and conditions 

blocking inflammation or processes of tissue destruction resulting from systemic cytokine damage. 
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A protein of the present invention may also be useful for 
promoting or inhibiting differentiation of tissues described 
above from precursor tissues or cells; or for inhibiting the 
growth of tissues described above. 

The activity of a protein of the invention may, among 5 
other means, be measured by the following methods: 

Assays for tissue generation activity include, without 
limitation, those described in: International Patent Publica- 
tion No. WO95/16035 (bone, cartilage, tendon); Interna- 
tional Patent Publication No. WO95/05846 (nerve, 10 
neuronal); International Patent Publication No. W091/ 
07491 (skin, endothelium). 

Assays for wound healing activity include, without 
limitation, those described in: Winter, Epidermal Wound 
Healing, pps. 71-112 (Maibach, H. L and Rovee, D. T., 15 
eds.). Year Book Medical Publishers, Inc., Chicago, as 
modified by Eaglstein and Mertz, J. Invest. Dermatol 
71382-84 (1978). 
Activin/Inhibin Activity 

A protein of the present invention may also exhibit 20 
activin- or inhibin-related activities. Inhibins are character- 
ized by their ability to inhibit the release of follicle stimu- 
lating hormone (FSH), while activins and are characterized 
by their ability to stimulate the release of follicle stimulating 
hormone (FSH). Thus, a protein of the present invention, 25 
alone or in heterodimers with a member of the inhibin a 
family, may be useful as a contraceptive based on the ability 
of inhibins to decrease fertility in female mammals and 
decrease spermatogenesis in male mammals. Administration 
of sufficient amounts of other inhibins can induce infertility 30 
in these mammals. Alternatively, the protein of the 
invention, as a homodimer or as a heterodimer with other 
protein subunits of the inhibin-p group, may be useful as a 
fertility inducing therapeutic, based upon the ability of 
activin molecules in stimulating FSH release from cells of 35 
the anterior pituitary. See, for example, U.S. Pat No. 4,798, 
885. A protein of the invention may also be useful for 
advancement of the onset of fertility in sexually immature 
mammals, so as to increase the lifetime reproductive per- 
formance of domestic animals such as cows, sheep and pigs. 40 

The activity of a protein of the invention may, among 
other means, be measured by the following methods: 

Assays for activin/inhibin activity include, without 
limitation, those described in: Vale et aL, Endocrinology 
91:562-572, 1972; Ling et al., Nature 321:779-782, 1986; 45 
Vale et al.. Nature 321:776-779, 1986; Mason et al., Nature 
318:659-663, 1985; Forage et al., Proc. Natl. Acad Sci. 
USA 833091-3095, 1986. 
Chemotactic/Chemokinetic Activity 

A protein of the present invention may have chemotactic 50 
or chemokinetic activity (e.g., act as a chemokine) for 
mammalian cells, including, for example, monocytes, 
fibroblasts, neutrophils, T-cells, mast cells, eosinophils, epi- 
thelial and/or endothelial cells. Chemotactic and chemoki- 
netic proteins can be used to mobilize or attract a desired cell 55 
population to a desired site of action. Chemotactic or chemo- 
kinetic proteins provide particular advantages in treatment 
of wounds and other trauma to tissues, as well as in 
treatment of localized infections. For example, attraction of 
lymphocytes, monocytes or neutrophils to tumors or sites of 60 
infection may result in improved immune responses against 
the tumor or infecting agent 

A protein or peptide has chemotactic activity for a par- 
ticular cell population if it can stimulate, directly or 
indirectly, the directed orientation or movement of such cell 65 
population. Preferably, the protein or peptide has the ability 
to directly stimulate directed movement of cells. Whether a 
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particular protein has chemotactic activity for a population 
of cells can be readily determined by employing such 
protein or peptide in any known assay for cell chemotaxis. 

The activity of a protein of the invention may, among 
other means, be measured by the following methods: 

Assays for chemotactic activity (which will identify pro- 
teins that induce or prevent chemotaxis) consist of assays 
that measure the ability of a protein to induce the migration 
of cells across a membrane as well as the ability of a protein 
to induce the adhesion of one cell population to another cell 
population. Suitable assays for movement and adhesion 
include, without limitation, those described in: Current 
Protocols in Ixnmunology, Ed by J. E. Coligan, A. M. 
Kruisbeek, D. H. Marguiles, E. M. Shevach, W. Strober, 
Pub. Greene Publishing Associates and Wiley-Interscience 
(Chapter 6.12, Measurement of alpha and beta Chemokines 
6.12.1-6.12.28; Taub et aL J. Clin. Invest 95:1370-1376, 
1995; JJnd et al. APMIS 103:140-146, 1995; Muller et al 
Eur. J. Immunol. 25:1744-1748; Gruber et al. J. of Immunol. 
152:5860-5867, 1994; Johnston et al. J. of Immunol. 
153:1762-1768, 1994. 
Hemostatic and Thrombolytic Activity 

A protein of the invention may also exhibit hemostatic or 
thrombolytic activity. As a result, such a protein is expected 
to be useful in treatment of various coagulation disorders 
(including hereditary disorders, such as hemophilias) or to 
enhance coagulation and other hemostatic events in treating 
wounds resulting from trauma, surgery or other causes. A 
protein of the invention may also be useful for dissolving or 
inhibiting formation of thromboses and for treatment and 
prevention of conditions resulting therefrom (such as, for 
example, infarction of cardiac and central nervous system 
vessels (e.g., stroke). 

The activity of a protein of the invention may, among 
other means, be measured by the following methods: 

Assay for hemostatic and thrombolytic activity include, 
without limitation, those described in: Linet et at, J. Clin. 
Pharmacol. 26:131-140, 1986; Burdick et al., Thrombosis 
Res. 45:413-419, 1987; Humphrey et al., Fibrinolysis 
5:71-79 (1991); Schaub, Prostaglandins 35:467-474, 1988. 
Receptor/Ligand Activity 

A protein of the present invention may also demonstrate 
activity as receptors, receptor ligands or inhibitors or ago- 
nists of receptor/ligand interactions. Examples of such 
receptors and ligands include, without limitation, cytokine 
receptors and their ligands, receptor kinases and their 
ligands, receptor phosphatases and their ligands, receptors 
involved in cell-cell interactions and their ligands (including 
without limitation, cellular adhesion molecules (such as 
selectins, integrins and their ligands) and receptor/ligand 
pairs involved in antigen presentation, antigen recognition 
and development of cellular and humoral immune 
responses). Receptors and ligands are also useful for screen- 
ing of potential peptide or small molecule inhibitors of the 
relevant receptor/ligand interaction. A protein of the present 
invention (including, without limitation, fragments of recep- 
tors and ligands) may themselves be useful as inhibitors of 
receptor/ligand interactions. 

The activity of a protein of the invention may, among 
other means, be measured by the following methods: 

Suitable assays for receptor-ligand activity include with- 
out limitation those described in:Current Protocols in 
Jjnmunology, Ed by J. E. Coligan, A. M. Kruisbeek, D. H. 
Margulies, B. M. Shevach, W. Strober, Pub. Greene Pub- 
lishing Associates and Wiley-Interscience (Chapter 728, 
Measurement of Cellular Adhesion under static conditions 
7.28.1-7.28.22), Takai et aL, Proc. Natl. Acad Sci. USA 
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84:6864-6868, 1987; Bierer et al., J. Exp. Med. as an antigen in a vaccine composition to raise an immune 

168:1145-1156, 1988; Rosenstein et at, J. Exp. Med. response against such protein or another material or entity 

169-149-160 1989; Stoltenborg et al., J. Immunol. Methods which is cross-reactive with such protein. 

1 ^^X S ^- Cam ^°- m5 : . . ADMMSTRAnON AND DOSING 

Proteins of the present invention may also exhibit anti- A protein of the present invention (from whatever source 

inflammatory activity. The anti-inflammatory activity may derived, including without limitation from recombinant and 

be achieved by providing a stimulus to cells involved in the non-recombinant sources) may be used in a pharmaceutical 

inflammatory response, by inhibiting or promoting cell-cell composition when combined with a pharmaceutically 

interactions (such as, for example, cell adhesion), by inhib- io acce p ta | > i e carrier. Such a composition may also contain (in 

iting or promoting chemotaxis of cells involved in the addition to protein and a carrier) diluents, fillers, salts, 

inflammatory process, inhibiting or promoting cell buffers, stabilizers, solubilizers, and other materials well 

extravasation, or by stimulating or suppressing production known in the art The term pharmaceutically acceptable" 

of other factors which more directly inhibit or promote an means a non -toxic material mat does not interfere with the 

inflammatory response. Proteins exhibiting such activities 15 e £ ect i veness 0 f the biological activity of the active 

can be used to treat inflammatory conditions including ingredients). The characteristics of the carrier will depend 

chronic or acute conditions), including without limitation on ^ route of administration. The pharmaceutical compo- 

intimation associated with infection (such as septic shock, sition of ^ i nvC ntion may also contain cytokines, 

sepsis or systemic inflammatory response syndrome (SIRS) lymphokines, or other hematopoietic factors such as M-CSF, 

), kchemia-reperfusion injury, endotoxin lethality, arthritis, 20 GM _csp, TNE IL-1. IL-2, IL-3, IL-4, ILr5, IL-6, IL-7, IL-8, 

complement-mediated hyperacute rejection, nephritis, tj^g, iL-io, II^ll, 11^12, 1L-13, IU14, IL^15, IFN, TNFO, 

cytokine or chemokine-induced lung injury, inflammatory yNFl, TNF2, G-CSF, Meg-CSF, thrombopoietin, stem cell 

bowel disease, Crohn's disease or resulting from over pro- factor,' and erythropoietin. The pharmaceutical composition 

duction of cytokines such as TNF or IL-1. Proteins of the may father contain other agents which either enhance the 

invention may also be useful to treat anaphylaxis and 25 activ | t y 0 f me protein or compliment its activity or use in 

hypersensitivity to an antigenic substance or material. treatment. Such additional factors and/or agents may be 

Tumor Inhibition Activity included in the pharmaceutical composition to produce a 

In addition to the activities described above for immuno- synergistic effect with protein of the invention, or to mini- 
logical treatment or prevention of tumors, a protein of the ^ Conversely, protein of the present inven- 
invention may exhibit other anti-tumor activities. A protein 30 ^ Qn ^ mc i u ded in formulations of the particular 
may inhibit tumor growth directly or indirectly (such as, for cytokine, lymphokine, other hematopoietic factor, throm- 
example, via ADCQ. A protein may exhibit its tumor bolytic or anti-thrombotic factor, or anti-inflammatory agent 
inhibitory activity by acting on tumor tissue or tumor tQ j^^q s id e effects of the cytokine, lymphokine, other 
precursor tissue, by inhibiting formation of tissues necessary hematopoietic factor, thrombolytic or anti-thrombotic factor, 
to support tumor growth (such as, for example, by iimtoiting 35 Qr anti-Mammatory agent. 

angiogenesis). by causing production of other factors, agents A ^ rf ^ Qt mventioil ^ ^ ac tive in 

or cell types wWch inhibit tumor growth, or by suppressing, ( heterodimers or homodimers) or complexes 

eliminating or inhibiting factors, agents or cell types which ^ itsclf v or 6 other ^^5. M a result, pharmaceutical 

promote tumor growth. compositions of the invention may comprise a protein of the 

Other Activities 40 invention in such multimeric or complexed form. 

A protein of the invention may also exhibit one or more £ MtMf s nn ™™ 

of the following additional activities or effects: inWbitingthe The pharmaceuUcal composition of the invention may be 

growth, infection or function of, or killing, infectious agents, ™ the form of a complex of the F otein(s) of >esent 

deluding, without limitation, bacteria, viruses, fungi and invention along witoprotem or r^tide antigens. The pr^ 

other parasites; effecting (suppressing or enhancing) bodily 45 and/or peptide antigen will dehver a stknulatory s^tal to 

chanScs mdudml i heighCweight, both B and T ryrnphocytes. B ymphocytes will respond to 

S eye color, skin, fat to lean ratio oTtffer tislue antigen through their surface re^T 

pigmentation, or organ or body part size or shape (such as, lynmhoc^swill respond to antigen through the T ceH 

^example, teeast augmentationor diininution, change in receptor (TCR) foUowing presentation of the antigen by 

£T£ orsha^fe^ecting biorhythms <* caricadic 50 MHC proteins. MHC and structurally r^tcd ? oto 

cydesormythrns;^ectmgme ^ mdudmg toose encoded by dasslan d ^/ M «Cfi e ^ 

Objects; effS the metabolism, catabolism, anabolism, on host cells will serve to present the peptide anUgen(s) to 

Socessing, utiUzf tion, stcrageTelimination of dietary fat, T lymphocytes : ™e^dgen component, could also be 

S^pro&n, carbohydrate, Wtamins, minerals, cefaclors or supplied as purified MHC-peptide complexes .aloneorwith 

otiier nutritional factors or component(s); effecting behav- 55 co-stimulatory molecules toat can directly signal T cells. 

ScCc^tics, mdudmg^ithom IMtatio^lppetite. Alternativdy antibodies able to bind surface immune gobu- 

libido, stress, cognition (induding cognitive disorders), lin and other molecules on B cells as well as antibodies able 

Sr«sion (including depressive disorders) and violent to bind the TCR and other molecules on T cetts can be 

behaWors; providing Ldg£ic effects or other pain reducing combined with the pharmaceutical composition of the inven- 

effects; promoting differentiation and growth of embryonic 60 kon. 

stem cells in lineages other man hematopoietic lineages; The pharmaceutical composition of the invention may be 

hormonal or endocrine activity; in the case of enzymes, in the form of a liposome in which protein of the present 
correcting defidendes of the enzyme and treating invention is combined,in addition to omer pharmaceutically 
deficiency-related diseases; treatment of hyperproliferative acceptable carriers, with amphipathic agents such as lipids 

disorders (such as, for example, psoriasis); 65 which exist in aggregated form as micelles, insoluble 
inimunoglobulin-like activity (such as, for example, the monolayers, liquid crystals, or lamellar layers in aqueous 
ability to bind antigens or complement); and the ability to act solution. Suitable lipids for liposomal formulation include, 
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without limitation, monoglyccrides, diglycerides, sulfatides, or subcutaneous injection, protein of the present invention 

lysolecithin. phospholipids, saponin, bile acids, and the like. will be in the form of a pyrogen-free, parenteraUy acceptable 

Preparation of such liposomal formulations is within the aqueous solution. The preparation of such parenteraUy 

level of skill in the art, as disclosed, for example, in U.S. Pat acceptable protein solutions having due regard to pH, 
Nos. 4.235.871; 4 .501.728; 4,837,028; and 4,737323, all of 5 lsotoniaty stability, and foe like, is within the skill in the art 

which are incorporated herein by reference. * preferred pharmaceutical composition for intravenous, 

r J . cutaneous, or subcutaneous injection should contain, in 

As used herein, the term "therapeutically effective addition to protein of the present invention, an isotonic 

amount" means the total amount of each active component vcn icle such as Sodium Chloride Injection, Ringer's 

of the pharmaceutical composition or method that is suffi- injection. Dextrose Injection, Dextrose and Sodium Chlo- 

dent to show a meaningful patient benefit, Lc, treatment, i° jjde Injection, Lactated Ringer's Injection, or other vehicle 

healing, prevention or amelioration of the relevant medical as known in the art The pharmaceutical composition of the 

condition, or an increase in rate of treatment, healing, present invention may also contain stabilizers, preservatives, 

prevention or amelioration of such conditions. When applied buffers, antioxidants, or other additives known to those of 

to an individual active ingredient administered alone, the skill in the art 

term refers to that ingredient alone. When applied to a 15 The amount of protein of the present invention in the 

combination, the term refers to combined amounts of the pharmaceutical composition of the present invention will 

active ingredients that result in the therapeutic effect, depend upon me nature and severity of me condition being 

whether administered in combination, serially or simulta- treated, and on the nature of prior treatments which the 

neously patient has undergone. Ultimately, the attending physician 

mpracticingmeiiK^odoftreatrnentoruseofthepresent 20 will decide the amount of protein of the Resent ^ention 

*~. ~r ^ „ _ . o _\ . - with which to treat each individual patient initially, the 

invention, a therapeuticaUy effective amount of protein of ^ q ^ administer lo £ ^ of ^ of 

the present invention is administered to a inammal having a ^ " i vcntion ^ obscrvc te pBtaA ^ rcsponsc . 

condition to be treated Protein of the present invention may ^ doses rf ^ n of me present mvcntkm ^ be 

be administered in accordance with the method of the adniinistered until the optimal therapeutic effect is obtained 

invention either alone or in combination with other therapies f or ^ and at that point the dosage is not increased 

such as treatments employing cytokines, lymphokines or further. It is contemplated that the various pharmaceutical 

other hematopoietic factors. When co-administered with one compositions used to practice the method of the present 

or more cytokines, lymphokines or other hematopoietic invention should contain about 0.01 fig to about 100 mg 

factors, protein of the present invention may be administered ^ (preferably about 0.1 ug to about 10 mg, more preferably 

either simultaneously with the cytokine(s), lymphokines), aoout o.l ug to about 1 mg) of protein of the present 

other hematopoietic factor(s), thrombolytic or anti- invention per kg body weight 

thrombotic factors, or sequentially. If administered ^ 6urajdon of intravenous therapy using the pharma- 

sequentially, the attending physician will decide on the ceu tical composition of me present invention will vary, 

appropriate sequence of administering protein of the present 35 depending on me seV erity of the disease being treated and 

invention in combination with cytokine(s), lymphokines), ^ condition and potential idiosyncratic response of each 

other hematopoietic factor(s), thrombolytic or anti- individual patient It is contemplated that the duration of 

thrombotic factors. eacn application of the protein of the present invention will 

Administration of protein of the present invention used in be in the range of 12 to 24 hours of continuous intravenous 

the pharmaceutical composition or to practice the method of ^ administration. Ultimately the attending physician will 

the present invention can be carried out in a variety of decide on the appropriate duration of intravenous therapy 

conventional ways, such as oral ingestion, inhalation, topical using the pharmaceutical composition of the present inven- 

application or cutaneous, subcutaneous, intraperitoneal, tion. 

parenteral or intravenous injection. Intravenous adrninistra- Protein of the invention may also be used to immunize 

tion to the patient is preferred. 45 animals to obtain polyclonal and monoclonal antibodies 

When a therapeutically effective amount of protein of the which specifically react with the protein. Such antibodies 

present invention is administered orally, protein of the may be obtained using either the entire protein or fragments 

present invention will be in the form of a tablet, capsule, thereof as an immunogen. The peptide immunogens addi- 

powder, solution or elixir. When administered in tablet form, tionally may contain a cysteine residue at the carboxyl 

the pharmaceutical composition of the invention may addi- 50 terminus, and are conjugated to a hapten such as keyhole 

tionally contain a solid carrier such as a gelatin or an limpet hemocyanin (KLH). Methods far synthesizing such 

adjuvant The tablet, capsule, and powder contain from peptides are known in the art, for example, as in R. P. 

about 5 to 95% protein of the present invention, and pref- Merrifield, J. Amer. Chem. Soc. 85, 2149-2154 (1963); J. L. 

erably from about 25 to 90% protein of the present inven- Krstenansky, et al., FEBS Lett 211, 10 (1987). Monoclonal 

tion. When administered in liquid form, a liquid carrier such 55 antibodies binding to the protein of the invention may be 

as water, petroleum, oils of animal or plant origin such as useful diagnostic agents for the immunodetection of the 

peanut oil, mineral oil, soybean oil, or sesame oil, or protein. Neutralizing monoclonal antibodies binding to the 

synthetic oils may be added. The liquid form of the phar- protein may also be useful therapeutics for both conditions 

maceutical composition may further contain physiological associated with the protein and also in the treatment of some 

saline solution, dextrose or other saccharide solution, or $} forms of cancer where abnormal expression of the protein is 

glycols such as ethylene glycol, propylene glycol or poly- involved. In the case of cancerous cells or leukemic cells, 

ethylene glycol When administered in liquid form, the neutralizing monoclonal antibodies against the protein may 

pharmaceutical composition contains from about 0 J to 90% be useful in detecting and preventing the metastatic spread 

by weight of protein of the present invention, and preferably of the cancerous cells, which may be mediated by the 

from about 1 to 50% protein of the present invention. 55 protein. 

When a therapeutically effective amount of protein of the For compositions of the present invention which are 

present invention is administered by intravenous, cutaneous useful for bone, cartilage, tendon or ligament regeneration, 



5,61 

25 

the therapeutic method includes administering the compo- 
sition topically, systematically, or locally as an implant or 
device. When administered, the therapeutic composition for 
use in this invention is, of course, in a pyrogen-free, physi- 
ologically acceptable form. Further, the composition may 
desirably be encapsulated or injected in a viscous form for 
delivery to the site of bone, cartilage or tissue damage. 
Topical ad minis tration may be suitable for wound healing 
and tissue repair. Therapeutically useful agents other than a 
protein of the invention which may also optionally be 
included in the composition as described above, may alter- 
natively or additionally, be administered simultaneously or 
sequentially with the composition in the methods of the 
invention. Preferably for bone and/or cartilage formation, 
the composition would include a matrix capable of deliver- 
ing the protein-containing composition to the site of bone 
and/or cartilage damage, providing a structure for the devel- 
oping bone and cartilage and optimally capable of being 
resorbed into the body. Such matrices may be formed of 
materials presently in use for other implanted medical 
applications. 

The choice of matrix material is based on 
biocompatibility, biodegradabDity, mechanical properties, 
cosmetic appearance and interface properties. The particular 
application of the compositions will define the appropriate 
formulation. Potential matrices for the compositions may be 
biodegradable and chemically defined calcium sulfate, 
tricalciumphosphate, hydroxyapatite, polylactic acid,polyg- 
lycolic acid and polyanhydrides. Other potential materials 
are biodegradable and biologically well-defined, such as 
bone or dermal collagen. Further matrices are comprised of 
pure proteins or extracellular matrix components. Other 
potential matrices are nonbiodegradable and chemically 
defined, such as sintered hydroxyapatite, bioglass, 
alurninates, or other ceramics. Matrices may be comprised 
of combinations of any of the above mentioned types of 
material, such as polylactic acid and hydroxyapatite or 
collagen and tricalciumphosphate. The bioceramics may be 
altered in composition, such as in calcium-aluminate- 
phosphate and processing to alter pore size, particle size, 
particle shape, and biodegradability. 

Presently preferred is a 5050 (mole weight) copolymer of 
lactic acid and glycolic acid in the form of porous particles 
having diameters ranging from 150 to 800 microns. In some 
applications, it will be useful to utilize a sequestering agent, 
such as carboxyrnethyl cellulose or autologous blood clot, to 
prevent the protein compositions from disassociating from 
the matrix. 

A preferred family of sequestering agents is cellulosic 
materials such as alkylcelluloses (including 
hydroxyalkylcelluloses), including methylceliulose, 
ethylcellulose, hydroxy ethylcellulose, 
hydroxypropylcellulose, hyioxypropyl-methylcellulose, 
and carbcxymethylcellulose, the most preferred being cat- 
ionic salts of carboxymerhylcellulose (CMC). Other pre- 
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ferred sequestering agents include hyaluronic acid, sodium 
alginate, poly(ethylcne glycol), polyoxyethylene oxide, car- 
boxyvinyl polymer and polyvinyl alcohol). The amount of 
sequestering agent useful herein is 0.5-20 wt %, preferably 

5 1-10 wt % based on total formulation weight, which rep- 
resents the amount necessary to prevent desorbtion of the 
protein from the polymer matrix and to provide appropriate 
handling of the composition, yet not so much that the 
progenitor cells are prevented from irJUtrating the matrix, 

10 thereby providing the protein the opportunity to assist the 
osteogenic activity of the progenitor cells. 

In further compositions, proteins of the invention may be 
combined with other agents beneficial to the treatment of the 
bone and/or cartilage defect, wound, or tissue in question. 

15 These agents include various growth factors such as epider- 
mal growth factor (EGF), platelet derived growth factor 
(PDGF), transforming growth factors (TGF-a andTGF-P), 
and insulin-like growth factor (IGF). 
The therapeutic compositions are also presently valuable 

20 for veterinary applications. Particularly domestic animals 
and thoroughbred horses, in addition to humans, are desired 
patients for such treatment with proteins of the present 
invention. 

25 The dosage regimen of a protem-containing pharmaceu- 
tical composition to be used in tissue regeneration will be 
determined by the attending physician considering various 
factors which modify the action of the proteins, e.g., amount 
of tissue weight desired to be formed, the site of damage, the 

30 condition of the damaged tissue, the size of a wound, type 
of damaged tissue (e.g., bone), the patient's age, sex, and 
diet, the severity of any infection, time of administration and 
other clinical factors. The dosage may vary with the type of 
matrix used in the ^constitution and with inclusion of other 

35 proteins in the pharmaceutical composition. For example, 
the addition of other known growth factors, such as IGF I 
(insulin like growth factor I), to the final composition, may 
also effect the dosage. Progress can be monitored by peri- 
odic assessment of tissue/bone growth and/or repair, for 

40 example, X-rays, histomorphometric determinations and tet- 
racycline labeling. 

Polynucleotides of the present invention can also be used 
for gene therapy. Such polynucleotides can be introduced 
either in vivo or ex vivo into cells for expression in a 

45 mammalian subject Polynucleotides of the invention may 
also be adniinistered by other known methods far introduc- 
tion of nucleic acid into a cell or organism (including, 
without limitation, in the form of viral vectors or naked 
DNA). 

50 Cells may also be cultured ex vivo in the presence of 
proteins of the present invention in order to proliferate or to 
produce a desired effect on or activity in such cells. Treated 
cells can then be introduced in vivo for therapeutic purposes. 

Patent and literature references cited herein are incorpo- 
rated by reference as if fully set forth. 



SEQUENCE LISTING 



( 1 ) GENERAL INFORMATION: 

( i i i ) NUMBER OF SEQUENCES: U 

( 2 ) INFORMATION FOR SEQ ID NO:l: 
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-continued 



( 1 ) SEQUENCE CHARACTERISTICS; 

( A ) LENGTH: 432 haio pair* 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: double 
{ D ) TOPOLOGY: linear 



( i i ) MOLECULE TYPE: cDNA 



( x i ) SEQUENCE DESCRIPTION: SEQ ID KO:l: 










OO T T T OA AAA CTCTOCTTCC 


TTTOTGAATT 


TGGTGTT AGO 


AGTTCTT ATT 


GTTATTCTOC 


60 


AOCCTTT ACT ATTOTCCTTT 


ATTTACTGAA 


CACAOTOAAT 


ACCAAGCACT 


OTTT ATT AOA 


1 2 0 


GOTTAOOAGT AOGGOCAOOT 


OATTAAAAAA 


ACAAAAAAGC 


T A AT A ATC TC 


CTCAAOCAAT 


1 8 0 


TTCTOOCCTA ATAGAATTAT 


AOTAOACAGT 


OAAOTATCTA 


AACCC AGOOA 


A T C AO A T TO A 


2 40 


OOCACCATOT CCATCOCCTT 


OAOAATTAAT 


AGGCTOCATT 


TCTOOGTTCT 


CCNTTTTTTT 


3 0 0 

V V 


TTTTTTTTTG CCCAACTGAO 


TCTTTCTGTG 


G AC T T AC AT 0 


OAACTTCTTA 


TTCTCTTA A A 


3 6 0 


TCATTAAGTT ACTTGACAAT 


ATTCTTOGAT 


TT GG AG A A AC 


TOOATGTAGG 


OCCGTATOAA 


4 2 0 


AAAATCATTC OA 










4 3 2 


( 2 ) INFORMATION FOR SEQ ID N02: 












( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 62 amino adds 
( B ) TYPE: amino add 
( C } STRANDEDNESS: 
< D ) TOPOLOGY: linear 








* 


( i i ) MOLECULE TYPE: protein 












( x i ) SEQUENCE DESCRIPTION: SEQ ID N03: 










Met Set lie Ala Leu Arg lie 


Asa Arg Leu 
1 o 


His Pbe Trp Val Leu Xa a 

I 5 




Phe Pbe Pbe Pbe Pho Ala Gin 

2 0 


Leu Ser Leu 
2 5 


Ser Val Asp Leu His Oly 

3 0 




Tbr Ser Tyr Ser Leu Ly i Ser 

3 5 


Leu Scr Tyr 
4 0 


Leu Tbr 11 
4 5 


e Pbe Leu Asp 




Leu Glu Lys Leu Asp Val Gly 
5 0 5 5 


Pro Tyr Glu 


L y i lie 11 
6 0 


e Arg 




( 2 ) INFORMATION FOR SEQ ID N03: 












( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 219 base pairs 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: double 
(D ) TOPOLOGY: finear 










( i i ) MOLECULE TYPE: cDNA 












( x i ) SEQUENCE DESCRIPTION: SEQ ID N03: 










A T AGO AT ACN GTATCTNOCT 


TTTTTCATTT 


AAACOTCONO 


AGC AATTTTC 


CC A AO A CAT A 


60 


AC A A ACTGTC TTNOAAAAAN 


GGAAAACATT 


NOGGGCTGTC 


AGCANAACNG 


AAAATGTTTT 


1 2 0 


CTOGGTGAOA CACATOTATC 


TTNONAATGG 


GT T GG AT T T A 


GTGTGCTTTA 


TTTC AATAAA 


1 S 0 


AATTCAOTAT T AT AATTT A A 


AAAAAAAAAA 


AAA A A A A A A 






2 1 9 



( 2 ) INFORMATION FOR SEQ ID NO:4: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 501 base pairs 
( B ) TYPE: nudnc acid 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: finear 
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-continued 



( i i ) MOLECULE TYPE: cDNA 

( x I ) SEQUENCE DESCRIPTION: SEQ ID NO*: 

TCCACAOGTO TCCANTCCCA OOTCCAACTO CAOATTTCOA ATTCOOCCTT CATOOCCTAO 60 

AOCOACOCOO AOAARAOCTC COOOTOCCOC OOCACTOCAO COCTOAOATT CCTTTACAAA 120 

OAAACTCAOA OOACCGOOAA OAAAOAATT T CACCTTTOCO ACOTOCTAOA A AAT A AROTC 180 

OTCTOOOAAA AOOACTOOAO ACACAAOCOC ATCSCAASY Y SROTOAAOOA SAAASNOAKO 240 

OANBTAKWWM MOWOSWOAAA AATKT YWWKC AAMMWMOOTA TTTTCCCTTO OATATTAACT 300 

TOCATATCTO AAOAAATGGC ATTCCOOACA ATTTOCGTOT TGGTTGGAOT ATTTATTTGT 360 

TCTATCTOTO TOA A AGO AT C TTCCCAOCCC CAAOCAAOAO TTTATTTAAC AT T TOATG AA 420 

C T T CO AO A A A CCAAGACCTC TOAATACTTC AOCCTTTCCC ACCATCCTTT AO AC T AC AOG 480 

ATT T T AT T A A TOGATGAAGA T 



( 2 ) INFORMATION FORSEQ ID NOJ: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: ft amino adds 
( B ) TYPE: amino add 
( C ) STRANDEDNESS: 
( D ) TOPOLOGY: finear 

(it ) MOLECULE TYPE: proton 

< x i ) SEQUENCE DESCRIPTION: SEQ ID NOd: 

Met Ala Phe Arg Thr lie Cyi Val Lou Val Gly Val Phe lie Cyt Ser 
j 5 10 15 

lie Cyi Val Ly» ,G I y Ser Ser Ola Pro Gin Ala Arg Val Tyr Leu Thr 

20 ^ 2 5 30 

Phe A»p Glu Leo Arg Glu Thr Lyi Thr Ser Glu Tyr Phe Ser Leu Ser 
35 40 * 5 

Bit HI i Pro Leu A»p Tyr Arg lie Leu Leu Met A§p Glu Alp 

( 2 ) INFORMATION FOR SEQ ID NO*: 

( i ) SEQUENCE CHARACTERISTICS: 
< A ) LENGTH: 302 base para 
( B ) TYPE: noddc add 
( C ) STRANDEDNESS: double 
( D)TOPOLOOY: finear 

( 1 1 ) MOLECULE TYPE* cDNA 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO£: 



( 2 ) INFORMATION FOR SEQ ID NOt7: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 448 base pain 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: double 



5 0 1 



CTAOCACTAO 


ACATOTCATO 


GTCTTC ATGG 


TGCATATAAA 


TAT ATTT AAC 


TTAACCCAGA 


60 


TTTTATTTAT 


ATCTTTATTC 


ACCTTTTCTT 


CAAAATCOAT 


ATOGTOOCTO 


C A A A AC TAG A 


1 2 0 


ATTGTTOCAT 


CCCTCAATNG 


AATGAGGGCC 


ATATCCCTOT 


GOT ATTCCTT 


TCCTOCTTNO 


1 ft 0 


GOGCTTTAGA 


ATTCTAATTO 


TCAOTOATTT 


TOT AT A TO AA 


AAC AAGTTCC 


AAATCCACAG 


2 4 0 


CTTTTACOTA 


OT AAAAGTCA 


T A AATGC AT A 


TOACAOAATO 


OCT AT C A AAA 


GAAAAAAAAA 


3 0 0 


AA 












3 0 2 
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-continued 



( D ) TOPOLOOY: linear 

( i i ) MOLECULE TYPE; cDNA 

< x i ) SEQUENCE DESCRIPTION: SEQ ID NO:7: 

OGCGAARGCA OCOOCAOOTC OOOAOCAARA TOOCOCTOCO OCCAOOAGCT OOTTCTOOTO 60 
OCOOCOOGOC COCOAROAK Y ATR- 

RYO YORK KT Y YRY YSKO KKWKSMOOST TCATOTTTCC 120 

TOTTOCAGOT OOOATAAOAC CCCCTCAAOO CCTOATOCCO ATGCAOCAAC AAGG ATTTCC 180 

TATOOTCTCT GTCATOCAOC C T A AT ATGC A AOOCATTATO OOAATGAATT ACAOCTCTCA 240 

OATOTCCCAA GOACCTATTG C T ATGC AGGC AGGAATACCA ATOOOACCAA TGCCAOCAGC 300 

OOOAATOCCT TACCTAGGAC AAGCACCCTT CCTOGGCATG COTCCTCCAG OCCCACAGTA 360 

CACTCCAOAC ATGCAGAAOC AGTTTOCCOA AGAGCAOCAG AA AC OAT T TO AACAOCAGCA 420 

AAAACTCTTA OAAAAAAAAA A A A A A A A A 448 

( 2 ) INFORMATION FOR SEQ ID NO:S: 

( i ) SEQUENCE CHARACTERISTICS: 

( A ) LENGTH: 107 amino adds 
( B ) TYPE: ammo add 
( C ) STRANDEDNESS: 
( D ) TOPOLOGY: fiaear 

( i i ) MOLECULE TYPE: potan 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOA 

Met Phc Pro Val Ala Oly Gly lie Arg Pro Pro Ola Gly Leo Met Pro 
I 5 10 15 

Met Gin Gin Gin Gly Phe Pro Met Val Scr Val Met Gin Pro A*n Met 

2 0 2 3 3 0 

Gin Gly lie Met Oly Met Aid Tyr Sei Ser Olo Met Ser Gin Gly Pro 
3 5 4 0 4 5 

lie Ala Met Gin Ala Oly lie Pro Met Gly Pro Met Pro Ala Ala Gly 
5 0 5 5 6 0 

Met Pro Tyr Leo Oly Gin Ala Pro Phe Leu Oly Met Arg Pro Pro Oly 
65 70 75 80 

Pro Gin Tyr Tbr Pro Alp Met Gin Ly» Gin Phe Ala Glu Glu Gin Gin 

8 5 90 9 5 

Lys Arg Phe Glu Gin Gin Oln Ly« Leu Leu O 1 u 

10 0 10 5 

( 2 ) INFORMATION FOR SEQ ID NO:9: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 29 base pain 
( B ) TYPE: nucleic acid 
( C ) STRANDEDNESS: single 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: other nucleic acid 

( A ) DESCRIPTION: ASesc = 'ioSgosneleonde H 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOA 

ONOCCTCAAT CTOATTCCCT OGOTTT AO A 29 



( 2 ) INFORMATION FOR SEQ ID NO. 10: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 29 base pass 
( B ) TYPE: nucleic add 
( C ) STRANDEDNESS: tingk 



33 
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( D ) TOPOLOGY: fines 

( i i ) MOLECULE TYPE: otber nucleic add 

( A ) DESCRIPTION: Mac = ^Egomxlsotider 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N0:l(fc 
ONCCOOAATO CCATTTCTTC AO AT ATOC A 



( 2 ) INFORMATION FOR SEQ ID NOill: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 29 base pars 
( B ) TYPE: oneiric add 
( C ) STRANDEDNESS: angle 
( D ) TOPOLOGY: linear 

< 1 1 ) MOLECULE TYPE: otber nucleic add 

( A ) DESCRIPTION: /dese » •^figeouclootider 

( z 1 ) SEQUENCE DESCRIPTION: SEQ ID NO:ll: 
TNCCAT TOOT ATTCCTOCCT OCATAOCAA 



25 



35 



40 



What is claimed is: 

1. An isolated polynucleotide selected from the group 
consisting of: 

(a) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:l; 

(b) a polynucleotide comprising the nucleotide sequence 30 
of SEQ ID N0:1 from nucleotide 247 to nucleotide 
432; 

(c) a polynucleotide comprising the nucleotide sequence 
of SEQ ID NO:l from nucleotide 328 to nucleotide 
432; 

(d) a polynucleotide comprising the nucleotide sequence 
of the full length protein coding sequence of clone 
BD372 _J> deposited under accession number ATCC 
98146; 

(e) a polynucleotide encoding the full length protein 
encoded by the cDN A insert of clone BD372__5 depos- 
ited under accession number ATCC 98146; 

(f) a polynucleotide comprising the nucleotide sequence 
of the mature protein coding sequence of clone 
BD372_^5 deposited under accession number ATCC 
98146; 

(g) a polynucleotide encoding the mature protein encoded 
by the cDNA insert of clone BD372_5 deposited under 
accession number ATCC 98146; and 

(h) a polynucleotide encoding a protein comprising the 
amino add sequence of SEQ ID NO:2. 

2. The polynucleotide of claim 1 comprising the nucle- 
otide sequence of SEQ ID NO:l. 

3. The polynucleotide of claim 1 comprising the nude- 55 
otide sequence of SEQ ID NO:l from nucleotide 247 to 
nucleotide 432. 



45 



50 



4. The polynudeotide of claim 1 comprising the nucle- 
otide sequence of SEQ ID NO:l from nucleotide 328 to 
nudeotide 432. 

5. The polynudeotide of claim 1 comprising the nude- 
otide sequence of the full length protein coding sequence of 
done BD372_J deposited under accession number ATCC 
98146. 

6. The polynudeotide of daim 1 encoding the full length 
protein encoded by the cDNA insert of done BD372_J 
deposited under accession number ATCC 98146. 

7. The polynudeotide of claim 1 comprising the nude- 
otide sequence of the mature protein coding sequence of 
done BD372_5 deposited under accession number ATCC 
98146. 

8. The polynudeotide of claim 1 encoding the mature 
protein encoded, by the cDNA insert of done BD372.J 
deposited under accession number ATCC 98146. 

9. The polynucleotide of claim 1 encoding a prolan 
comprising the amino add sequence of SEQ ID NO:2. 

10. A vector comprising a polynudeotide of claim 1 
wherein said polynudeotide is operably linked to an expres- 
sion control sequence. 

1L A host cell transformed with a vector of claim 2. 

12. The host cell of claim 3, wherein said cell is a 
mammalian cell. 

13. A process for producing a protein, which comprises: 

(a) growing a culture of the host cell of claim 3 in a 
suitable culture medium; and 

(b) purifying the protein from the culture. 

14. An isolated gene corresponding to the cDNA sequence 
of SEQ ID NO:l. 
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ABSTRACT 



The present invention relates to purified DNA sequences 
encoding all or a portion of an osteoclast-specific or -related 
gene products and a method for identifying such sequences. 
The invention also relates to antibodies directed against an 
osteoclast-specific or -related gene product. Also claimed 
are DNA constructs capable of replicating DNA encoding all 
or a portion of an osteoclast-specific or -related gene prod- 
uct, and DNA constructs capable of directing expression in 
a host cell of an osteoclast-specific or -related gene product 
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HUMAN OSTEOCLAST-SPECIFIC AND clast marker, type 5 tartrate-resi slant acid phosphatase 

-RELATED GENES (TRAP) and with the use of monoclonal antibody reagents. 

The stromal cell population lacking osteoclasts was pro- 

RELATED APPLICATION duccd by dissociating cells of a giant ceil tumor, then 

5 growing and passaging the cells in tissue culture until the 

This application is a continuation of application Sen No. cc u population was homogeneous and appeared fibroblastic. 

08/045,270 filed on Apr. 6, 1993 now abandoned. -j^ t cultured stromal cell populadon did not contain osteo- 
clasts. The cultured stromal cells were then used to produce 

BACKGROUND OF THE INVENTION a stromal celT. osteoclast" 32 P-labelled cDNA probe. 

Excessive bone resorption by osteoclasts contributes to '° ™ e C ^NA library piodacedhom the giant ceU tumor of 

the pathology of many human diseases including arthritis, * c bo ™ A was scrccncd m for ^fczaiion to 

ostcoporosisT periodontitis, and hypercalcemia of malig- the cDNA probes: re torn was performed with the giant 

nancy During^orption, osteoclasts remove both the min- «» mmor c pn>be (stromal cell*, ostcoc^£), while a 

eral and organic components of bone (Blair, RC..etaI.,/ 15 d T™?^7t? I™!? U ? g i , 

Cell Biol 102:1164 0 986)). The mineral phase is solubi- cell cDNA probe (stromal cell , osteoclasO. Hjrbndizauon 

lizec by acidificaUon of the sub-osteoclastic lacuna, thus <° a ?* oraaI • ™*> c] f accompanied by failure to 

allowing dissolution of bydroxyapatite (Vaes, G., Clin, 10 a 5° mar ; ? stcoc !f ^ fat a 

Orthcp Relat 231:239 (1988)). However, the mechanism(s) clonc * ucleic *** se< I ucnces 

by which type I collagen, the major structural protein of ^ expressed by osteoclasts. 

bone, is degraded remains controversial. In addition, the ^ another embodiment, genomic DNA encoding osteo- 

regulation of osteoclastic activity is only partly understood. clast -specific or -related gene products is identified through 

The lack of information concerning osteoclast function is known hybridization techniques or amplification techniques, 

due in pan to me fact that these cells are extremely difficult to one embodiment, the present invention relates to a 

10 isolate as pure populations in large numbers. Furthermore, ^ - method of identifying DNA encoding an osteoclast-specific 

there arc no osteoclastic cell lines available. An approach to or -related protein, or gene product, by screening a cDNA 

studying osteoclast function that permits the identification of library or a genomic DNA library with a DNA probe 

heretofore unknown osteoclast-specific or -related genes and comprising one or more sequences selected from the group 

gene products would allow identification of genes and gene consisting of the DNA sequences set out in Table I (SEQ ID 

products that are involved in the resorption of bone and in 30 NOs: 1-32). Finally, the present invention relates to an 

the regulation of osteoclastic activity. Therefore, identifica- osteoclast-specific or related protein encoded by a nucle- 

tion of osteclast-specific or -related genes or gene products otide sequence comprising a DNA sequence selected from 

would prove useful in developing therapeutic strategies for the group consisting of the sequences set out in Table I, or 

the treatment of disorders involving aberrant bone resorp- their complementary strands, 
lion. i< 



SUMMARY OF THE INVENTION 

The present invention relates to isolated DNA sequences 
encoding all or a portion of osteoclast-specific or -related ^ 
gene products. The present invention further relates to DNA 
constructs capable of replicating DNA encoding osteoclast- 
specific or -related gene products. In another embodiment, 
the invention relates to a DNA construct capable of directing 
expression of all or a portion of the osteoclast-specific or 45 
-related gene product in a host cell. 

Also encompassed by the present invention are prokary- 
otic or eukaryotic cells transformed or transfected with a 
DNA construct encoding all or a portion of an osteoclast- 
specific or -related gene product. According to a particular 50 
embodiment, these cells are capable of replicating the DNA 
construct comprising " the DNA encoding the osteoclast- 
specific or -related gene product, and, optionally, are capable 
of expressing the osteoclast-specific or -related gene prod- 
uct. Also claimed are antibodies raised against osteoclast- 55 
specific or -related gene products, or portions of these gene 
products. 

The present invention further embraces a method of 
identifying osteoclast-specific or -related DNA sequences 
and DNA sequences identified in this manner. In one 60 
embodiment, cDNA encoding osteoclast is identified as 
follows: First, human giant cell tumor of the bone was used 
to 1) construct a cDNA library; 2) produce "P-labelled 
cDNA to use as a stromal ceir*; osteoclast"* probe, and 3) 
produce (by culturing) a stromal cell population lacking 65 
osteoclasts. The presence of osteoclasis in the giant cell 
tumor was confirmed by histological staining for the ostco- 



BRIEF DESCRIPTION OF FIG. 1 

The FIG. 1 shows cDNA sequence (SEQ ID NO: 33) of 
human gelatinase B, and highlights those portions of the 
sequence represented by the osteoclast-specific or -related 
cDNA clones of the present invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 

As described herein, Applicant has identified osteoclast- 
specific or osteoclast-rclated nucleic acid sequences. These 
sequences were identified as follows: Human giant cell 
tumor of the bone was used to 1) construct a cDNA library; 
2) produce 32 P-labelled cDNA to use as a stromal cell*, 
osteoclasfprobe, and 3) produce (by culturing). a stromal 
cell population lacking osteoclasts. The presence of oste- 
clasts in the giant cell tumor was confirmed by histological 
staining for the osteoclast marker, type 5 acid phosphatase 
(TRAP). In addition, monoclonal antibody reagents were 
used to characterize the multinucleated cells in the giant cell 
tumor, which cells were found to have a phenorype distinct 
from macrophages and consistent with osteoclasts. 

The stromal cell population lacking osteoclasts was pro- 
duced by dissociating cells of a giant cell tumor, then 
growing the cells in tissue culture for at least five passages. 
After five passages the cultured cell population was homo- 
geneous and appeared fibroblastic. The cultured population 
contained no multinucleated cells at this point, tested nega- 
tive for type 5 acid phosphatase, and tested variably alkaline 
phosphatase positive. That is, the cultured stromal cell 
population did not contain osteoclasts. The cultured stromal 
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cells were then used to produce a stromal cell*, osteoclast" 
"P-labellcd cDNA probe. 

The cDNA library produced from the giant cell tumor of 
the bone was then screened in duplicate for hybridization to 
the cDN A probes: one screen was performed with the giant 
cell tumor cDNA probe (stromal celT, osteroclast*), while a 
duplicate screen was performed using the cultured stromal 
cell cDNA probe (stromal ceir osteoclast") Clones that 
hybridized to the giant cell tumor cDNA probe (stromal"*, 
osteoclast*), but not to the stromal cell cDNA probe (stro- 
mal*, osteoclast"), were assumed to contain nucleic acid 
sequences specifically expressed by osteoclasts. 

As a result of the differential screen described herein, 
DNA specifically expressed in osteoclast cells characterized 
as described herein was identified This DNA, and equiva- 
lent DNA sequences, is referred to herein as osteoclast- 
specific or osteoclast-relaled DNA. Osteoclast-specific or 
-related DNA of the present invention can be obtained from 
sources in which it occurs in nature, can be produced 
recombinantly or synthesized chemically; it can be cDNA, 
genomic DNA, recombinanUy-produced DNA or chemi- 
cally-produced DNA. An equivalent DNA sequence is one 
which hybridizes, under standard hybridization conditions, 
to an osteoclast-specific or -related DNA identified as 
described herein or to a complement thereof. 

Differential screening of a human osteoclastoma cDNA 
library was performed to identify genes specifically 
expressed in osteoclasts. Of 12,000 clones screened, 195 
clones were identified which are either uniquely expressed 
. in osteoclasts, or are osteodast-related. These clones were 
further identified as osteoclast-specific, as evidenced by 
failure to hybridize to mRNA derived from a variety of 
unrelated human cell types, including epithelium, fibro- 
blasts, lymphocytes, myelomonocytic cells, osteoblasts, and 
neuroblastoma cells. Of these, 32 clones contain novel 
cDNA sequences which were not found in the GenBank 
database. 

A large number of cDNA clones obtained by this proce- 
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expression in osteoclasts of an aberrant osteoclast -specific 
or -related gene product or to correct aberrant expression of 
an osteoclast-specific or -related gene product The 
sequences described herein can further be used to cause 
osteoclast-specific or related gene expression in cells in 
which such expression does not ordinarily occur, ix., in cells 
which are not osteoclasts. 

Example 1— Osteoclast cDNA Libary Construction 

Messenger RNA (mRNA) obtained from a human osteo- 
clastoma ("giant cell tumor of bone*), was used to construct 
an osteoclastoma cDNA library. Osteoclastomas are actively 
bone rcsorptive tumors, but are usually non-metastatxe. In 
cryostat sections, osteoclastomas consist of -30% multi- 
nucleated cells positive for tartrate resistant acid phos- 
phatase (TRAP), a widely utilized phenotypic marker spe- 
cific in vivo for osteoclasts (Minion, Calqf. Tissue Int. 
34285-290 (1982)). The remaining cells are uncharactcr- 
ized 'stromal 1 cells, a mixture of cell types with fibroblastic/ 
mesenchymal morphology. Although it has not yet been 
definitively shown, it is generally held that the osteoclasts in 
these tumors are non-transformed, and are activated to 
resorb bone in vivo by substance(s) produced by the stromal 
cell element 

Monoclonal antibody reagents were used to partially 
characterize the surface phenotype of the multinucleated 
cells in the giant cell rumors of long bone. In frozen sections, 
all multinucleated cells expressed CD68, which has previ- 
ously been reported to define an antigen specific for both 
osteoclasts and macrophages (Horton, M. A. and M. H. 
Helfrich, In Biology and Physiology of the Osteoclast, B. R. 
Rifkin and C. V Gay, editors, CRC Press, Inc. Boca Raton, 
Fla., 33-54 (1992)). In contrast, no staining of giant cells 
was observed for CDllb or CD 14 surface antigens, which 
are present on monocyte/macrophages and granulocytes 
(Amaout, M. A. et al. / Cell Physiol. 137:305 (1988); 
Haziot. A. et al. 7. Immunol 141:547 (1988)), Cytocentri- 
fugc preparations of human peripheral blood monocytes 



dure were found to represent 92 kDa type IV collagenase 40 were positive for CD68, CDllb, and CD14. These results 
(gelatinase B; E.C. 3.4.24.35) as well as tartrate resistant demonstrate that the multinucleated giant cells of osteoclas- 



acid phosphatase. In situ hybridization localized mRNA for 
gelatinase B to multinucleated giant cells in human osteo- 
clastomas. Gelatinase B inununoreactivity was demon- 
strated in giant cells from 8/8 osteoclastomas, osteoclasts in 45 
normal bone, and in osteoclasis of Paget 1 3 disease by use of 
a polyclonal antisera raised against a synthetic gelatinase B 
peptide. In contrast, no inununoreactivity for 72 kDa type IV 
collagenase (gelatinase A; E.G 3.4.24.24), which is the 

product of a separate gene, was detected in osteoclastomas » «™« were oouinco, » 

or normal osteoclasts of wluch n™™* ™*™ <* an average length 0.6 kB. 



to mas have a phenotype which is distinct from that of 
macrophages, and which is consistent with that of osteo- 
clasts. 

Osteoclastoma tissue was snap frozen in liquid nitrogen 
and used to prepare poly A* mRA according to standard 
methods. cDNA cloning into a pcDNAII vector was carried 
out using a commercially-available kit (Librarian, InVitro- 
gen). Approximately 2.6x1 0 6 clones were obtained, >95% 



The present invention has utility for the production and 
identification of nucleic add probes useful for identifying 
osteoclast-specific or -related DNA. Osteoclast-specific or 
-related DNA of the present invention can be used to 
produce osteoclast-specific or -related gene products useful 
in the therapeutic treatment of disorders involving aberrant 
bone resorption. The osteoclast-specific or -related 
sequences are also useful for generating peptides which can 
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Example 2 — Stromal Cell mRNA Preparation 

A portion of each osteoclastoma was snap frozen in liquid 
nitrogen for mRNA preparation. The remainder of the tumor 
was dissociated using brief trypsinization and mechanical 
disaggregation, and placed into tissue culture. These cells 
were expanded in Dulbecco's MEM (high glucose, Sigma) 
supplemented with 10% newborn calf serum (MA Bioprod- 



then be used to produce antibodies useful for identifying 60 ucts), gentamycin (Oi mg/ml), 1-glutamine (2 mM) and 



osteoclast-specific or -related gene products, or for altering 
the activity of osteoclast-specific or -related gene products. 
Such antibodies are referred to as osteoclast-specific anti- 
bodies. Osteoclast-specific antibodies are also useful for 
identifying osteoclasts, finally, osteoclast -specific or -re- 
lated DNA sequences of the present invention arc useful in 
gene therapy. For example, they can be used to alter the 



65 



non-essential amino acids (0.1 mM) (Gibco). The stromal 
cell population was passaged at least five times, after which 
it showed a homogenous, fibroblastic looking cell popula- 
tion that contained no multinucleated cells. The stromal cells 
were mononuclear, tested negative acid phosphatase, and 
tested variably alkaline phosphatase positive. These findings 
indicate that propagated stromal cells (ix,. stromal cells that 
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are passaged in culiure) are non-osicoclastic and non-acti- 
vated. 

Example 3-— Identification of DNA Encoding 
Osieoclasto ma-Specific or -Related Gene Products 
by Differential screening of an Osteoclastoma 
cDNA Library 

A total of 12,000 clones drawn from the osteoclastoma 
cDNA library were screened by differential hybridization, 
using mixed 33 P labelled cDNA probes derived from (1) 
giant cell tumor mRNA (stromal ceir, OC*), and (2) mRNA 
from stromal cells (stromal cell*, OCT) cultivated from the 
same tumor. The probes were labelled with "[PJdCTP by 
random priming to an activity of -10?CPM/ug. Of these 
12,000 clones, 195 gave a positive hybridization signal with 
giant cell (i.e., osteoclast and stromal cell) mRNA, but not 
with stromal cell mRNA. Additionally, these clones failed to 
hybridize to cDNA produced from mRNA derived from a 
variety of unrelated human cell types including epithelial 
cells, fibroblasts, lymphocytes, myelomonocytic cells, 
osteoblasts, and neuroblastoma cells. The failure of these 
clones to hybridize to cDNA produced from mRNA derived 
from other cell types supports the conclusion that these 
clones are either uniquely expressed in osteoclasts, or are 
osteoclast-related. 

The osteoclast (OC) cDNA library was screened for 
differential hybridization to OC cDNA (stromal cell\ OCT) 
and stromal cell cDNA (stromal cell*, OCT) as follows: 

NYTRAN filters (Schleicher & Schuell) were placed on 30 
agar plates containing growth medium and ampicillin. Indi- 
vidual bacterial colonics from the OC library were randomly 
picked and transferred, in triplicate, onto filters with prer- 
ulcd grids and then onto a master agar plate. Up to 200 
colonies were inoculated onto a single 90- mm filter/plate 35 
using these techniques. The plates were inverted and incu- 
bated at 37° C. until the bacterial inoculates had grown (on 
the filler) to a diameter of 0.5-1.0 mm. 

The colonies were then lysed, and the DNA bound to the 
filters by first placing the filters on top of two pieces of 40 
Whatman 3 MM paper saturated with 0.5N NaOH for 5 
minutes. The filters were neutralized by placing on two 
pieces of Whatman 3 MM paper saturated with 1M Tris- 
HCL, pH 8.0 for 3-5 minutes. Neutralization was followed 
by incubation on another set of Whatman 3 MM papers 45 
saturated with 1M Tris-HCL, pH 8.0/1.5M Nad for 3-5 
minutes. The filters were then washed briefly in 2xSSC. 

. DNA was immobilized on the filters by baking the filters 
at 80° C. for 30 minutes. Filters were best used immediately, 
but they could be stored for up to one week in a vacuum jar 
at room temperature. 

Filters were prehybridized in 5-8 ml of hybridization 
solution per filter, for 2-4 hours in a heal sealable bag. An 
additional 2 ml of solution was added for each additional 
filter added to the hybridization bag. The hybridization 
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buffer consisted of 5xSSC, 5xDenhardt's solution, 1% SDS 
and 100 ug/ml denatured heterologous DNA. 

Prior to hybridization, labeled probe was denatured by 
heating in lxSSC for 5 minutes at 100° C, then immediately 
chilled on ice. Denatured probe was added to the filters in 
hybridization solution, and the filters hybridized with con- 
tinuous agitation for 12-20 hours at 65° C. 

After hybridization, the filters were washed in 2xSSC/ 
0.2% SDS at 50*-60° C for 30 minutes, followed by 
washing in 0.2xSSC/0.2% SDS at 60° C. for 60 minutes. 

The filters were then air dried and autoradiographed using 
an intensifying screen at -70° C. overnight 

Example 4 — DNA Sequencing of Selected Clones 

Clones reactive with the mixed tumor probe, but unreac- 
tivc with the stromal cell probe, are expected to contain 
either osteoclast-related, or in vivo 'activated' stromal -cell - 
related gene products. One hundred and forty-four cDNA 
clones that hybridized to tumor cell cDNA, but not to 
stromal cell cDNA, were sequenced by the dideoxy chain 
termination method of Sanger et al. (Sanger R, et aL Pmc. 
Natl Acad. ScL USA 74:5463 (1977)) using sequenase (US 
Biochemical). The DNASIS (Hitatchi) program was used to 
carry out sequence analysis and a homology search in the 
CenBaiuVEMBL database. 

Fourteen of the 195 tumor"' stromal" clones were identi- 
fied as containing inserts with a sequence identical to the 
osteoclast marker, type 5 tartrate-resistant acid phosphatase 
(TRAP) (GenBank accession number J04430 Ml 9534). The 
high representation of TRAP positive clones also indicates 
the effectiveness of the screening procedure in enriching for 
clones which contain osteoclast-specific or related cDNA 
sequences. 

Interestingly, an even larger proportion of the tumor 4, 
stromal - clones (77/195; 39.5%) were identified as human 
gelatinase B (roacrophage-derived gelatinase) (Wilhelm, S. 
M. / Biol Chan. 264:17213 (1989)), again indicating high 
expression of this enzyme by osteoclasts. Twenty-five of the 
gelatinase B clones were identified by dideoxy sequence 
analysis; all 25 showed 100% sequence homology to the 
published gelatinase B sequence (Genbank accession num- 
ber J05070). The portions of the gelatinase B cDNA 
sequence covered by these clones is shown in the FIGURE 
(SEQ ID NO: 33). An additional 52 gelatinase B clones were 
identified by reactivity with a 32 P-labeIled probe for gelati- 
nase B. 

Thirteen of the sequenced clones yielded no readable 
sequence. A DNASIS search of GenBank/HMBL databases 
revealed that, of the remaining 91 clones, 32 clones contain 
novel sequences which have not yet been reported in the 
databases or in the literature. These partial sequences are 
presented in Table I. Note that three of these sequences were 
repeats, indicating fairly frequent representation of mRNA 
related to this sequence. The repeat sequences are indicated 
by* * superscripts (Clones 198B, 223B and 32C of Table I). 



TABLE I 



PARTIAL SEQUENCES OF 32 NOVEL OOSPECIHC OR -RELATED 
EXPRESSED GENES (cDNA CLONES) 



MA (SEQ H> NO: I) 
I G CAAATA TCT 
61 AATCTTTCTA 
121 GTGATATTCT 
4B (SEQ ID NO: 2) 
1 CTCTCAACCT 



AAGTTTATTG 
GGGTTTTTTT 
CTTTGAATAA 

CCATATCCTA 



CTTGCATTTC 
AGTTTGTTTT 
ACCTATAATA 

AAAATGTCAA 



TAOTOAGAGC 
TATTGAAAAA 
GAAAATAGCA 

AATOCTGCAT 



TGTTGAATTT 

TTTAATTATT 

GCACACAACA 

CTOOTTAATG 



GGTGATGTCA 
TATGCTATAG 



TCGGGGTAGG 
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TABLE J-continued 



PARTIAL SEQUENCES OF 32 NOVEL OCSPECEFIC OR -RELATED 
EXPRESSED GENES (cDNA CLONES) 



61 CGO 
12B(SSQP NO:3) 
I CTTCCCTCTC 
61 CAGGCCCACA 
121 CAACCACCTO 
28B (SEQ ID NO: 4) 
1 TTTTATTTGT 
61 GTGTGTTTTC 
121 AAAOCAAACT 
37B (SEQ ID NO: 5} 
1 GGCTGGACAT 
61 TTGCCCTGGC 
121 AGCCACTTTG 
181 ACAAAAAAAA 
55B (SEQ ID NO: 6) 
1 TTGACAAAGC 
61 AAGAGTAGTG 
121 TAATTTGCCT 
60B(SEQIDNO: 7) 
I GAAGAGAGTT 
61 GATCCCGAGG 
86B (SEQ ID NO: 8) 
1 GGATGGAAAC 
61 GCAAACCTGA 
121 TGGTTGCTCT 
87B (SEQ ID NO: 9) 
1 TTCTTGATCT 
61 TAGGAGCCGT 
181 CAATGATAAA 
98B (SEQ ID NO: 10) 
1 ACCCATTTCT 
61 CTCAAAGAAT 
121 GAATATGAGG 
HOB (SEQ ID NO. 11) 
1 ACATATATTA 
61 TAAAGTGGGA 
121 TAACrmTT 
118B(SEQ ©NO 12) 
I CCAAATTTCT 
61 TTTGACTACT 
133B(SEQIDNO: 13) 
I AACTAACCTC 
61 CCTCAGCCAT 
121 AAAT 
140B (SEQ ID NO: 14) 
1 ATTATTATTC 
61 AAAACACACA 
121 GATAAACCCG 
1448 (SEQ IDNO: 15) 
1 CGTGACACAA 
61 AACAGCATGT 
198B a (SEQmNO. 16) 
1 ATAGGTTAGA 
61 ATCTGACTTC 
121 TCTACTCCAA 
181 ATGTGATTTG 
241 TTTAAT 
212B(SEQIDNO: 17) 
1 GTCCAGTATA 
61 CCTCTAGATA 
121 AATGCCCTTC 
181 TCTGGAGC 
223B»(SEQIDNO: 18) 
1 GCACTT GGAA 
61 TGTTCAGTTT 
121 CCATGACCTT 
18] TAAGAGATGT 
24JB (SEQ IDNO: 19) 
1 TGTTAGTTTT 
61 CTAGACGTCC 
121 GGAAGGGCTC 
181 CTATATGACC 
32C* (SEQ ID NO. 20) 
1 CCTATTTCTG 
121 TCCGTCTAOC 
161 GGGTGGAAGG 



TTGCTTCCCT 

GGGAGTACTG 

GTGGTGAATG 

AAATATATGT 
GTCTTGCTTC 
GGOGGOATGO 

GGGTGCCCTC 
CATGTCATCT 
TTACGCGACG 
AAAAAAA 

TGTTTATTTC 
GCTATTATAT 
TC 

GTATGTACAA 
GAATT 

ATGTAGAAGT 
GATTTCAGCA 
TGCACGTATC 

T TAGAA CACT 
GCTTTTGGAA 
ACTTGACAAA 



AACAATTTTT 

AGAGGCAATA 

ACAAGCTCTA 

ACAGCATTCA 
ATGTATCAAG 
TTTTTACATT 

CTGGAATCCA 
CCAGC 

CTCGGACCCC 
GGCCATCGCT 



TTTTTTTATG 

TCCCAJTGAA 

GCACGTCCTG 

ACATGCATTC 
TCATCAGCAG 

TTCTCATTCA 
TCACTTCCTA 
TTCATAAATC 
TCTTCCCTTC 



AAGGAAAGCG 
AAACACCCGA 
TACACATTAG 



GGGAGTTCGT 
CCCCATTTGT 
TTTCACTGTG 
GACTACAGCC 

TACGAAGGCC 
TATAGTTAGT 
TTTGCTAGTA 
ATAGTAAGGC 

ATCCTGACTT 

AGAGCGTGCA 

GGCAGGATTC 



TTCCCAAGCA 
CCAGACTACT 
CTCCCTGGCA 

ATTACATCCC 
TTCATGGTCC 
AAGCAGATTA 

CACGTCCCTC 
ACCTGGAGTG 
ATTTCCCAGA 



CACCAATAAA 
GGGCTATCAT 



CCCCAACAGC 



CCAGA GAAAA 

TAAAATCTTT 

AATAGGTTAT 

ATGAATAGGG 
TGCTTGAGTG 
A 

ACTGTAAAAT 
TATAGCCCAT 
GTCGTCATTA 

TTTGGCCAAA 

TATAGACTAT 

ATAAAATTAA 

Tccrcccrcc 



TGCCrCACTC 
TATGAGCGGC 



TTAGCTTAGC 
GGGTTTTGTA 
ATAGGAAATT 

GTTTTATTCA 
GAAGCTGGCC 

CGGGACTAGT 
AGTTCCCTCT 
TATTCA TAAG 
TTTGCACTTT 



TTAAGTCGGT 
TTAACAGATG 
CTCCAGCTAA 



GTGCTATTTT 
TTGTGCTTCA 
GCCATCAAGG 
TGCCCCTGAC 

TGTCTTCTGG 
CACTGGGGAT 
TCTCCATTTC 
TGT 

TGGACAAGGC 

CTTGTGATCC 

TGCAGCTGCT 



GAGGTGCTCA 

GCTGATGTTC 

CGGGACCCCC 

TAGAAAAAGA 

ATGATGCCAG 

TTCTGCCATT 

ATATCCCCAG 

GGCCCTOCCCC 

CCACTCATCA 



TAGTATATGG 
GTTGATGCTC 



CAAGCCAGCT 



ACAATTTTAA 

AGTTAGAAGT- 

C 

AAAAAAGAAA 
AGGAGCTCAA 



TTTTGGTCAA 
CTTACTAGAC 
AACCCCTCAG 

ATCTACACGT 

GA AAGTO CAA 

CTTGTTT 

CATCACCATA 



ATTTACACCA 
OCAGTGATTA 



CATGCAAAAT 
CATTTCAGTC 
C 

TAAAACAGCC 
GTGGGCAGGG 

TAGCTTTAAG 
TATATCCTCA 
TCTTTGGTAC 
TRAAATAAAG 



AAGCTAGAGG 

TTAACCTTTT 

AAAGACACAT 



TGAAGCAGAT 
AATGATCCTT 
ACTTTCCTGA 
TG 

GAGTOAGGTT 
GGTGAAAGAG 
TAGAAGATGG 



CCTTCAGCCA 
TAAAATAAGC 
TTTGCATTTC 



CTCCATGGCC 
TCTTAAGGCC 
CCC 

ATCCCAGGAT 
CTGAGGTTCT 
TTTCCAGGTC 

GCACACTCTG 
TTCTTCAGCC 
CATTAAAAAA 



TGATTGGGGT 
ATAAATAGTT 



AAATGCAGAG 



AAAAAGGTGG 
OAOAGAAAGA 



AAACTGTTCA 
CAAGTCCTCT 



AGTTCTAAGC 

ATACAGTATT 

AA 

TTGTAGAATC 
ATAACAAGTC 



GCCTCGAOAC 



ACCACCCAAC 
TAGGCTTTCG 



TTACTGCTGA 
CTTACAAATA 



TGGTTTCCTA 
GCCCC 

CACCCTAGAG 
AGGTAGAAAT 
AAGTTACATG 
TATTTATCTC 



AT TGTAA ATA 
ATGTTTTGAT 
TGAGAGCITA 



GTGGTGATAC 
CCTACTTTGC 
CAGCTTGTGT 



TATTAGTCCA 

GGAGAAGAGG 

TTTAGATGAT 



OAAGACTGAC 

TTCAT CTCOG 

TCTTCCTAAA 



ACCGCCACCA 
CAGGGAGTCT 



TTTCCCTCCT 

CAGTACAATG 

TTT 

GCCTCAGCTT 
T TOAAT CAAA 
TATTTTGAAA 



TTCTATTTAT ' 
CATATCTACT 



GGTACAGAGA 



AAAAGTTACG 
AGAGGGAGGC 



AAATAAAATG 
CCCAAGAAAG 



TTAATCACAT 
AAACTGGACT 



CTACTGTATA 
AAGGTTAGAT 



GTCATTTCTG 



TATCTATAAA 
CTCTAAGATA 



AGCAGTTAAT 
ACAAAGCAAT 



AAACAATACA 



GACTAGGGTA 
GTCTATGTTT 
ATAAAAAGAA 
CTGTCTACAG 



T C ' 1 ' 1 ' 1 fA T GT 

TTGCTTTAAA 

GAGGATAGTC 



TGAOATTGTC 
TTCTCTCCAC 
ACTCTTAGGC 



CTTCTTGCAG 

AAGGGCGAAG 

AACCACAGGT 



AAAGTCATCC 
GCTGTGCCTT 
TTTCATT 
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PARTIAL SEQUENCES OF 32 NOVEL OC-SPEOHC OR -RELATED 
EXPRESSED GENES (cPNA CLONES) 



34C (SEQ ID NO: 21) 
1 CGGAGOOTAO 
61 COGCCCOCAC 
47C (SEQ ID NO: 22) 
I TTAGTTCAGT 
61 GTGCCACCTG 
121 GGAGCTGACC 
65C (SEQ ID NO: 23) 
1 GCTGAATGTT 
61 TGCAAGTGTG 
121 AACTGCCOOT 
79C (SEQ ID NO: 24) 
1 GCCAGTCGGA 
61 AOAAAACTGG 
121 CATTGCCAAC 
UC (SEQ ID NO: 25) 
1 GCCAGGGCGO 
61 GACCTGCAGT 
121 OGTGOCTGAG 
86C (SEQ ID NO: 26) 
1 AACTCTTTCA 
61 GTTCATATCA 
121 TTCAATTATA 
87C (SEQ ID NO: 27) 
1 GGATAAGAAA 
61 CGCAGCAGCC 
121 GTCCTGGTTG 
8SC (SEQ ID NO: 28) 
I CTGACCTTCG 
61 TGTTCAACGG 
89C (SEQ ID NO 29) 
1 ATCCCTGGCT 
61 TC OCTGA GTT 
121 TCGTTTTCTG 
101 C (SEQ ID NO: 30) 
1 GCCTGGGCAT 
61 GTGCCAGCCC 
121 CGTTAGCTTT 
112C (SEQ ID NO: 31) 
1 CCAACTCCTA 
161 CAATACTCTC 
114C(SEQIDNa 32) 
1 CATGGATGAA 



GTGTOTTTAT 
CCATCACCCC 

CAAAGCAGGC 
GGGAGGTTTC 
CAGAOTGGA 

TAAGAGAGAT 
AATTACGTGG 
TTAGAGTCCT 

TATGGAATCC 

GGAAACAAAG 

CTGGCCAGCT 



ACOGTCTTTA 
GGGOGCTAGT 
TAGAACTTGT 

CACTCTGGTA 

ATTCATATTG 

AGAATATATC 

GAAGGCCTGA 
CGCACAGGTT 
GCCGGTGGAG 

AGAGTTTGAC 
AGCOGTGAGC 

GTGGATAGTC 
TGOGAGTGTG 
GTOATGTTGT 

CCCTCTCCTC 

GGCTCTGAAO 

CCCATAAGGT 

COGCGATACA 
CTAAAATAAA 

TGTCTCATGG 



TCCrGTACAA 
AGTGCAATGG 

AACOCCCTTT 
CCCAACACCC 



TTTGGTCTTA 
TATGGATGGT 
CTTAATATTG 

AGAAGGGAAA 

GATATATCCT 

TCCCCAAGAT 



TTCCTCT C CT 
CATCTGTGGC 
TCTGGAATTC 

TI 11 1A GTT T 

AGCTGTCTCA 

CTAATACTTT 

GGCCTAGGGG 

GAGAGGGGCA 

AGCCACAAAA 

CTOGAGCCGG 
GACGACTCCG 

CTTTTGTGTA 

CAAGTACTAC 

GCTAACAATA 

CTCCATCCCC 

CCAAGGGCCG 

TGGAGTATCT 

GACCCACAOA 
CATGAAGCAC 

TGGGAAGGAA 



ATCATTACAA 
CTAGCTGCTG 

GGCACTGCTG 
TOCTCTGCTT 



AAGGCTTCAT 

TGCTTGTTTA 

ATGTOCTAAC 

CAAGCACTGG 
CATGGCTCGA 
GTGACTCCAG 

GCCTCAGAGG 
AGCGAAGGTG 
C 

A ACAATATAT 
TTC7TTTTTT 
TTAAAA 

CGGRGGCTGG 
CrrCCTCTTG 



ATACCTACTG 
CTGGGGAAGT 

GCAAATGCTC 
TTAACTGTCT 
AGAATAC 

ATACATCACC 

TCCGTGCCAC 

GC 

GTGCCATCCC 



CATCGTACAT 



A ACCAA GTCT 
GCCTTT 

CCACTGOOGT 
CCCTGTCTGT 



CATOAAAGTO 
TTAACTAAAG 
ACTGGGTCTG 

ATAATTAAAA 
AATAAGAACA 

CCAGAAA 

TCAGGAAGGA 
AAGGGACTCA 



GTGTTGTGTC 
AATGGTCATA 



CCIOC CTCT C 
CTTAGGTTGG 



CCGCTATGAC 
TCTGCGGCGA 

OCTCCTTAAG 
GTCCTGCTTG 



AGGTCTAATG 
GGTGGCTGTG 



TGAGAGACCA 
TTC 



GGGGCAGTCA 



CATGGCGGTT 
CGGGGTCTCA 



TACATGCATA 
ATGTACAGCA 

CTTATGC 

ACAGCTGGGG 
ACGCCTGTGG 



GGTCTGGCAG 
CCTTGTCGCC 



TTGGAAATTA 
TACAGTAGTA 



AGTCCTGGGA 
TGAGGATCTG 



TCGGTCAGCG 
T 

GTTATAGGGC 
GCTGTOGTTA 



TTTACAAACG 
AGTATTCCTC 



GACCGCTCCC 



■Rcpcaird 3 tunes 
^Repeated 2 times 



Sequence analysis of the OC* stromal ceir cloned DNA 
sequences revealed, in addition to the novel sequences, a 45 
number of previously-described genes. The known genes 
identified (including type 5 acid phosphatase, gelatinase B, 
cystatin C (13 clones), Alu repeat sequences (II clones), 
creamine kinase (6 clones) and others) are summarized in 
Table II. In situ hybridization (described below) directly 50 
demonstrated that gelatinase B mRNAis expressed in multi- 
nucleated osteoclasts and not in stromal cells. Although 
gelatinase B is a well-characterized protease, its expression 
at high levels in osteoclasts has not been previously 
described. The expression in osteoclasts of cystatin C, a 55 
cysteine protease inhibitor, is also unexpected. This finding 
has not yet been confirmed by in situ hybridization. Taken 
together, these results demonstrate that most of these iden- 
tified genes are ostcoclast-expressed, thereby confirming the 
effectiveness of the differential screening strategy for iden- « 
drying DNA encoding osteoclast-specific or -related gene 
products. Therefore, novel genes identified by this method 
have a high probability of being OC-specific or related. 

In addition, a minority of the genes identified by this 
screen are probably not expressed by OCs (Table D). For 65 
example, type HI collagen (6 clones), collagen type I (1 
clone), dermatansulfate (1 clone), and type VI collagen (1 



clone) are more likely to originate from the stromal cells or 
from osteoblastic cells which are present in the tumor. These 
cDNA sequences survive the differential screening process 
either because the cells which produce them in the tumor in 
vivo die out during the stromal cell propagation phase, or 
because they stop producing their product in vitro. These 
clones do not constitute more than 5-10% of the all 
sequences selected by differential hybridization. 

TABLE H 



SEQUENCE ANALYSIS OF CLONES ENCODING KNOWN 
SEQUENCES FROM AN OSTEOCLASTOMA cDNA 

LIBRARY 



Closes with Sequence Homology 25 total 

to Collagesase Type IV 

Ooaes with Sequence Homology to 14 total 

Type 5 Taraate Resistant Acid Phosphatase 

Clones with Sequence Homology 10 13 total 

CyttatinC 

Oonei with Sequence Homology to U total 

Alu-repeat Sequence* 

Qonei with Sequence Homology to 6* total 

Qeainioe Kinase 

Gooes with Srq ,tpneg Homology to 6 total 
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TABLE II -continued 



SEQUENCE ANALYSIS OF CLONES ENCODING KNOWN 
SEQUENCES FROM AN OSTEOCLASTOMA cDNA 

LIBRARY 



Type m Collagen 

Qones with Sequence Homology lo 5 total 

MHC Class I 7 In virion! Chain 

Clone* with Sequence Homology to 3 total 

MHC CtaM D0 Chain' 

One or Two Qonc(i) with Sequence Homology to Each 10 total 

of the FoDowing: 

al collagen type ) 

j interferon inducible protein 

osteoponcn 

Human caonQTPitm/oqraatmnalfgtc 
a globia 

{3 ghicradnse/sphmgofipid activator 
Human CAPL protein (Ca bindiog) 
Human EST 01024 
Type VI coHogco 
Human EST 00553 
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15 



20 



35 



40 



Example 5 — In situ Hybridiation of OC-Exprcssed 

Genes 

In situ hybridization was performed using probes derived 25 
from novel cloned sequences in order to determine whether 
the novel putative OC-specific or -related genes are differ- 
entially expressed in osteoclasts (and not expressed in the 
stromal cells) of human giant cell tumors. Initially, in situ, 
hybridization was performed using antisense (positive) and 30 
sense (negative control) cRNA probes against human type 
IY collagenase/gelatinase B labelled with 35 S-UTP. 

A thin section of human giant cell tumor reacted with the 
antisense probe resulted in intense labelling of all OCs, as 
indicated by the deposition of silver grains over these cells, 
but failed to label the stromal cell elements. In contrast, only 
minimal background labelling was observed with the sense 
(negative control) probe. This result confirmed that gelati- 
nise B is expressed in human OCs. 

In situ hybridization was then carried out using cRNA 
probes derived from 11/32 novel genes, labelled with 
digoxigenin UTP according to known methods. 

The results of this analysts are summarized in Table III. 
Clones 28B, 118B, 140B, 198B, and 212B all gave positive 45 
reactions with OCs in frozen sections of a giant cell tumor, 
as did the positive control gelatinase B. These novel clones 
therefore are expressed in OCs and fulfill all criteria for 
OC-relatcdness. 198B is repeated three times, indicating 
relatively high expression. Qones 4B, 37B, 88C and 98B 50 
produced positive reactions with the tumor tissue; however 
the signal was not well-localized to OCs. These clones are 
therefore not likely to be useful and are eliminated from 
further consideration. Qones 86B and 87B failed to give a 
positive reaction with any cell type, possibly indicating very 55 
low level expression. This group of clones could still be 
useful but may be difficult to study further. The results of this 
analysis show that 5/1 1 novel genes are expressed in OCs, 
indicating that -50% of novel sequences likely to be OC- 
related. 60 

To generate probes for the in situ hybridizations, cDNA 
derived from novel cloned osieoclast-speciflc or -related 
cDNA was subcloned into a BlueScript II SK(-) vector. The 
orientation of cloned inserts was determined by restriction 
analysis of subclones. The T7 and T3 promoters in the 65 
BlueScriplIl vector was used to generate ^-labelled ( M S- 
UTP 850 Ci/mmol, Amershara, Arlington Heights, 111.), or 
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UTP digoxygenin labelled cRNA probes. 

TABLE III 

m Situ HYBRIDIZATION USING PROBES 
DERIVED FROM NOVEL SEQUENCES 



Reactivity with: 



Clone 



Osteoclasts 



Stroma] Cells 



4B 


+ 




28B* 






37B 




+ 


S6B 






87B 






S8C 


+ 


4 


98B 


♦ 


+ 


1188* 


+ 




140B* 


+ 




198B* 


+ 




212B* 


•* 




Gelatinase B* 


+ 





•OC-cxpitued, as by reactivity with act: sense probe and lack of 

reactivity with sense probe on OCs only. 

In situ hybridization was carried out on 7 micron cryostat 
sections of a human osteoclastoma as described previously 
(Chang, L.-C. ct al. Cancer Res. 49:6700 (1989)). Briefly, 
tissue was fixed in 4% paraformaldehyde and embedded in 
OCT (Miles Inc., Kankakee, HI.). The sections were rehy- 
d rated, postfixed in 49b paraformaldehyde, washed, and 
pretrcated with 10 mM DTT, 10 mM iodoacetamide, 10 mM 
N-ethylmaleimide and 0.1 triethanoIaminc-HCL. Prehybrid- 
ization was done with 50% deionized form amide, 10 mM 
Tris-HCl, pH 7.0, lx Dcnhardt's, 500 mg/ml tRNA, 80 
mg/ml salmon sperm DNA, 0.3M NaQ, mM EDTA, and 
100 mM DTT at 45° C for 2 hours. Fresh hybridization 
solution containing 10% dextran sulfate and 1.5 ng/ml 
35 S-labelled or digoxygenin labelled RNA probe was 
applied after heat denaturation. Sections were coverslipped 
and then incubated in a moistened chamber at 45°-50° C. 
overnight Hybridized sections were washed four times with 
50% formamide, 2x SSC, containing 10 mM DTT and 0.5% 
Triton X-100 at 45° C. Sections were treated with RNase A 
and RNase Tl to digest single-stranded RNA, washed four 
times in 2x SSC/10 mM DTT. 

In order to detect 33 S-labelling by autoradiography, slides 
were dehydrated, dried, and coated with Kodak NTB-2 
emulsion. The duplicate slides were split, and each set was 
placed in a black box with desiccani, sealed, and incubated 
a 4° C for 2 days. The slides were developed (4 minutes) 
and fixed (5 minutes) using Kodak developer D19 and 
Kodak fixer. Hematoxylin and eosin were used as counter- 
stains. 

In order to detect digoxygenin-labelled probes, a Nucleic 
Acid Detection Kit (Boehringer-Mannheim, Cat. #1175041) 
was used. Slides were washed in Buffer 1 consisting of 100 
mM Tris/150 mM NaQ, pH7.5, for 1 minute. 100 ul Buffer 
2 was added (made by adding 2 mg/ml blocking reagent as 
provided by the manufacturer) in Buffer 1 to each slide. The 
slides were placed on a shaker and gently swirled at 20° C 

Antibody solutions were diluted 1:100 with Buffer 2 (as 
provided by the manufacturer). 100 ul of diluted antibody 
solution was applied to the slides and the slides were then 
incubated in a chamber for 1 hour at room temperature. The 
slides were monitored to avoid drying. After incubation with 
antibody solution, slides were washed in Buffer 1 for 10 
minutes, then washed in Buffer 3 containing 2 mM levami- 
sole for 2 minutes. 

After washing, 100 ul color solution was added to the 
slides. Color solution consisted of nitroblue/tetrazolium salt 
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(NBT) (1:225 dilution) pi, 5-bromo-4-diloro-3-indolyJ 
phosphate (1:285 dilution) 3.5 pi, Icvamisolc 0.2 mg in 
Buffer 3 (as provided by the manufacturer) in a total volume 
of 1 ml. Color solution was prepared immediately before 
use. 5 

After adding the color solution, the slides were placed in 
a dark, humidified chamber at 20° C for 2-5 hours and 
monitored for color development. The color reaction was 
stopped by rinsing slides in TB Buffer. 

The slides were stained for 60 seconds in 0.25% methyl 10 
green, washed with tap water, then mounted with water- 
based Permount (Fisher). 



is 
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Example 6 — lmrnunohistocheimstiy 

Immunobistochemica] staining was performed on frozen 
and paraffin embedded tissues as well as on cytospin prepa- 
rations (see Table TV). The following antibodies were used: 
polyclonal rabbit anti -human gelatinase antibodies; AbllO 
for gelatinase B; monoclonal mouse anti-human CD68 anti- 20 
body (clone KP1) (DAKO, Denmark); Mol (ami-CDllb) 
and Mo2 (anti-CD 14) derived from ATCC cell lines HB 
CRL 8026 and TIB 228/HB44. The anti-human gelatinase B 
antibody AbllO was raised against a synthetic peptide with 
the amino acid sequence EALMYPMYRFTEGPPLHK 25 
(SEQ ID NO: 34), which is specific for human gelatinase B 
(Corcoran, M. L. ei al. / BioL Chem % 267:515 (1992)). 

Detection of the immunohistochemical staining was 
achieved by using a goat anti-rabbit glucose oxidase kit 
(Vector Laboratories, Burlingame Calif.) according to the 30 
manufacturer's directions. Briefly, the sections were rehy- 
drated and pretested with cither acetone or 0.1% trypsin. 
Normal goat serum was used to block nonspecific binding. 
Incubation with the primary antibody for 2 hours or over- 
night (AbllO: 1/500 dilution) was followed by either a glu- 
cose oxidase labeled secondary anti-rabbit serum, or, in the 
case of the mouse monoclonal antibodies, were reacted with 
purified rabbit anti-mouse Ig before incubation with the 
secondary antibody. 

Paraffin embedded and frozen sections from osteoclasto- 
mas (GCT) were reacted with a rabbit antiserum against 
gelatinase B (antibody 110) (Corcoran, M. L. et al. / Biol 
chem. 267:515 (1992)), followed by color development with 
glucose oxidase linked reagents. The osteoclasts of a giant 
cell tumor were uniformly strongly positive for gelatinase B, 43 
whereas the stromal cells were unreactive. Control sections 
reacted with rabbit pre immune serum were negative. Iden- 
tical findings were obtained for all 8 long bone giant cell 
tumors tested (Table IV). The osteoclasts present in three out 
of four central giant cell granulomas (GCC) of the mandible 
were also positive for gelatinase B expression. These neo- 
plasms are similar but not identical to the long bone giant 
cell tumors, apart from their location in the jaws (Shafer, W. 
C. et aJ., Textbook of Oral Pathology, W. B. Saunders 
Company, Philadelphia, pp. 144-149 (1983)). In contrast, 
the multinucleated cells from a peripheral giant cell tumor, 
which is a generally non-resorptive tumor of oral soft tissue, 
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were unreactive with antibody (Shafer, W. G. el. al., Text- 
book of Oral Pathology, W. B. Saunders Company, Phila- 
delphia, pp. 144-149 (1983)). 

Antibody 110 was also utilized to assess the presence of 
gelatinase B in normal bone (n=3) and in Paget' s disease, in 
which there is elevated bone remodeling and increased 
osteoclastic activity. Strong staining for gelatinase B was 
observed in osteoclasts both in normal bone (mandible of a 
2 year old), and in Paget' s disease. Staining was again absent 
in controls incubated with prdmrnunc serum. Osteoblasts 
did not stain in any of the tissue sections, indicating that 
gelatinase B expression is limited to osteoclasts in bone. 
Finally, peripheral blood monocytes were also reactive with 
antibody 110 (Table IV). 

TABLE IV . 

DISTRIBUTION OF GELATINASE B IN VARIOUS 

TISSUES 



Staple! 



Antibodies tested 
Ab 110 
gelatinase B 



GCT frozen 
(n = 2) 

giant edit 
stromal cells 
GCT paraffin 
(0-6) 

giant cells 
stromal cells 
.central CCG 
(n = 4) 

giant cells 
stromal cells 
peripheral GCT 
tn-4? 

giant cells 
stromal cells 
Paget** disease 
(n= 1) 

osteoclasts 
osteoblasts 
normal bone 
(n*3) 

osteoclasts 
osteobUsu 
monocytes 
(cytospin) 



+(*) 



+ 
+ 



SO 



55 



Distribution of gelatinase B in multinucleated giant cells, osteoclasis, osteo- 
blasts and stromal cells ta various tissues. In general, paraffin embedded 
tissues were used for these experiments; exceptions arc uuhcaied. 

Equivalents 

Those skilled in the art will recognize, or be able to 
ascertain using no more than routine experimentation, many 
equivalents to the specific embodiments described herein. 
Such equivalents are intended to be encompassed by the 
following claims. 



SEQUENCE LUTING 



( I ) GENERAL INFORMATION: 

(Mi ) NUMBER OP SEQUENCES: 34 
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-continued 



( 2 ) INFORMATION FOR SEQ ED Nttl: 

( 1 ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 170 btte ptbs 
( B ) TYPE: t«fck «AS 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY Udcv 

( t I ) MOLECULE TYPE.* DNA (paeonc) 

( x I ) SEQUENCE DESOUPnOM: SEQ O NO:l: 
GCAAATATCT AAGTTTATTG CTTOO ATTTC TACTGACAGC TGTTCAATTT GGTC ATCTC A 60 
AATOTTTCTA GOOTTT TTTT AOTTTCTTTT TATTGAAAAA TTTAA TTATT T A TG C T A TAG 120 
GTGATATTCT CTTTCA ATAA ACCTATAATA GAAAATAGCA GCAGACAACA 170 

( 2 ) INFORMATION FOR SEQ ID NO* 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 63 bnc pain 
( B ) TYPE: ouclde Kid 
( C ) STRANDED NESi double 
( D ) TOPOLOGY: Uocc 

( i i )MC4£CUL£ TYPE: DNA (teuaic) 

(if) SEQUENCE DESCRIPTION: SEQ ED NOi 

GTCTCAACCT GCATATCCTA A A A A T OTCA A AATGCTOCAT C TGOTT A A TG TCGGGGTAOG 60 
GOG 



( 2 ) INFORMATION FOR SEQ ED N0:3: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 163 bue pan 
( B ) TYPE: Dudric add 
( C ) STRAND ED NESS: doubfc 
( D ) TOPOLOOY: 



63 



( i i ) MOLECULE TYPE: DNA (fecomic) 

( i i ) SEQUENCE DESCRIPTION: SEQ ED NOJ: 
CTTCCCTCTC TTGCTTCCCT TTCCCAACCA GACGTGCTCA CTCCATGGCC ACCGCCACCA 60 
CAGGCCCACA GGG AG T ACTG CCAGACT ACT GCTGATG TTC TCTTAACGCC CAOOOAOTCT 120 
CAACCAGCTG GTGGTG AATG CTGCCTCGCA CGGGACCCCC CCC 163 

( 2 ) INFORMATION FOR SEQ ED NO:* 

( t ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 173 but pain 
( B ) TYPE: nucleic tod 
( C ) STRANDEDNESS." dogWc 
( D ) TOPOLOGY linear 

I t i } MOLECULE TYPE: DNA ({esaxaic) 

( x I ) SEQUENCE DESCRIPTION: SEQ EO NQ4: 
TTTTATTTGT AAATATATGT ATT AC A TCC C TAGAAAAAGA ATCCCAGGAT TTTCCCTCCT 60 
GTGTCTTTTC GTCTTOCTTC TTCATGGTCC ATGATGCCAG CTG AGGTTOT C AG T A C A A TG 120 
AAACCAAACT CGCCCCATGO AAGCAGATTA TTCTOCCATT TTTCCAGGTC TTT 173 

{ 2 ) INFORMATION FOR SEQ ID NOJ: 

( t ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 197 ttucpin 
( B ) TYPE: muddc add 
( C ) STRANDEDNESS: double 
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( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE DNA (pnonic) 

( i 1 ) SEQUENCE DESCRIPTION: SBQ ID KOJ: 
COCTGOACAT GOCTGCCCTC CACCTCCCTC A.TATCCCC AG GCACACTCTC CCCTCACCTT 60 
TTGCCCTGGC CATOTCATCT ACCTCCAGTG OOCCCTCCCC TTCTTCAGCC T T O A A T C A A A 120 
AGCCACTTTG TTAGGCGAGG ATTTCCCACA CCACTCATCA CATTAAAAAA TATTTTOAAA 110 
A C A A A A A A A A AA A AAA A 1'7 

( 2 ) INFORMATION FOR SEQ ID NOtfc 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 1)2 buc pnn 
( B ) TYPE: tBZidc acid 
( C ) STRANDEDNESS: doubt 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (genomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO-.6: 
TTCACAAAGC TGTTTATTTC C A CCA ATA A A TAGTATATGG TCATTCGGGT TTCTATTTAT 60 
AAGAGTAOTG GCTATTATAT OOOGTATCAT OTTGATGCTC ATAAATAGTT CATATCTACT 120 
TAATTTGCCT TC 1*2 

( 2 ) INFORMATION FOR SEQ ID NOtT: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 75 base pain 
( B ) TYPE: nxkfc actd 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: lincxr 

(it) MOLECULE TYPE: DNA {pausaae) 

( i i ) SEQUENCE DESCRIPTION: SEQ ID NO.7: 
CAAGAGAGTT GTATGTACAA CCCCAACAGG CAAGGCAGCT AAATOCAOAO GGTACAGAGA 60 
GATCCCGAGG GAATT ?5 

( 2 ) INFORMATION FOR SEQ ID Nat: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 151 hue psn 
( B ) TYPE node* acid 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: bnctr 

( i i ) MOLECULE TYPE: DNA (genomic) . . ' 

( x i ) SEQUENCE DESCRIPTION; SEQ ID N03: 
GOATGCAAAC ATOTAOAAOT CCAGAGAAAA ACAATTTTAA AAAAAOGTGG AAAAGTTACG 60 
GC A A ACCTG A GATTTCAGCA TAAAATCTTT AGTTAGAAGT GAGAGAAAGA AGAGGGAGGC 1J0 
TOOTTGC TOT TGCACGTATC A AT AGGTT A T C 15 1 

( 2 ) INFORMATION FOR SEQ ID NG& 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 141 bax p&irt 
( B J TYPE nucleic Kid 
( C ) STRANDEDNESS: doubfc 
( D ) TOPOLOGY: Koctr 

( I i ) MOLECULE TYPE: DNA (fcaouk) 
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.(it) SEQUENCE DESCRIPTION: SEQ ID HOfr 

TTCTTOATCT TTACAACACT ATGAATAOOG AAAAAACAAA AAACTCTTCA AAATAAAATC . 60 

TACOAOCCGT OCTTTTGGAA TGCTTO A GTO AGGAGCTCAA C A AGTCCTCT CCCAAGAAAG 120 
CAATGATAAA ACTTOACAAA A 141 

< 2 ) INFORMATION FOR SEQ ID NO: 10; 

( i ) SEQUENCE CHARACTERISTICS: 
{ A ) LENGTH: 162 tare pain 
( B ) TYPE: oixkic arid 
( C ) STRANDEDNESS: double 
{ D ) TOPOLOOY: Ikxv 

( 1 I ) MOLECULE TYPE: DNA (jrnomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO: 10: 
ACCCATTTCT A A C A ATTTTT ACTGTAAAAT TTTTGOTCA A AGTTCTAAGC TTAATCACAT 60 
CTCAA AG A A T AGAGGCAATA TAT AGCCC A T CTT ACTAOAC ATACAOTATT A AACTGGACT 120 
G A A T A T 0 AG 0 ACAAGCTCTA GTGGTCATTA AACCCCTCAG AA 162 

( 2 ) INFORMATION FOR SEQ ID NOUl: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 157 bue pain 
( B ) TYPE: Budoc acid ... 
( C ) STRANDEDNESS: dosbfc 
( D ) TOPOLOGY: Uscv 

( i i ) MOLECULE TYPE: DNA (genomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NOItr 
ACATATATTA ACAGCATTCA TTTGCCCAAA ATCTACACGT TTGTAGAATC CTACTOTATA 60 
T A A AGTGGG A ATGTATCAAG TATAGACTAT GAAAGTGCAA ATAACAAOTC A AOGTTAGAT 120 
TAACTTTTTT TTTTTACATT ATA A A ATT AA CTTGTTT 157 

( 2 ) INFORMATION FOR SEQ ID NO: 1 2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 75 buc pain 
< B ) TYPE: saeleie acid 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: fcxar 

( i i ) MOLECULE TYPE: DNA (fciiomlc) 

( I i ) SEQUENCE DESCRIPTION: SEQ ID NO: 12: 
CCAAATTTCT CTGGAATCCA TCCTCCCTCC CATCACCATA GCCTCGAGAC GTC ATTTCTG 60 
TTTGACTACT CCAGC 75 

( 2 ) INFORMATION FOR SEQ CD NO: 11 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 124 base pan 
( B ) TYPE: noetic acid 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: tiaear 

( i i ) MOLECULE TYPE: DNA (fcnomit) 

( i i ) SEQUENCE DESCRIPTION: SEQ CD NO: 11 
AACTAACCTC CTCOGACCCC TGCCTCACTC ATTTA C A CC A ACCACCCAAC T A TC T A T A A A 60 
CCTGAGCCAT GGCCATCCCT TATG AGCGGC GC AGTO ATT A TAOGCTTTCG CTCTAAOATA 120 
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AA AT 12 4 

( 2 ) INFORMATION FOR SEQ ID N0:14: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 131 bus pain 
( B ) TYPE: nucleic acid 
( C ) 3TRANDEDNESS; double 
< D ) TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (fenonfc) 

( t i ) SEQUENCE DESCRIPTION: SEQ ID NO:14: 
ATTATTATTC TTTTTTTATC TTAOCTTAGC CATCCAAAAT TTACTOGTOA AGCAOTTAAT 60 
AAA A C AC ACA TCCCATTGAA CCCTTTTGTA CATTTCAGTC CTTAC A A AT A A C A A AO C A A T 120 
GATAAACCCO GCACGTCCTC ATACGAAATT C 13 1 

( 2 ) INFORMATION FOR SEQ ID NO: 15: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 103 bus pina 
( B ) TYPE: cuefcte add 
( C ) STRAND ED NESS: double 
( D ) TOPOLOGY: Kaear 

( [ i ) MOLECULE TYPE: DNA (jcoooric) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID NO:15: 
CGTGACACAA AC A TG CATTC OTTTTATTCA T A A A AC AGCC TGGTTTCC T A A A A C A AT A C A 60 
A AC ACCATGT TCATCAOCAO OAAOCTGGCC GTGGGCAGGG OOOCC 109 

( 2 ] INFORMATION FOR SEQ ID NO: 16: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 246 buc pain 
( B > TYPE: sucltic acid 
( C ) STRAND ED NESS: double 
( D ) TOPOLOGY: tmctr 

( I I ) MOLECULE TYPE: DNA (genomic) 

( x i J SEQUENCE DESCRIPTION: SEQ ID NO: 16: 



TTCTCATTCA 


COOOACT AGT 


TACCTTTA AC 


C ACCCTAC AG 


GACTAGGGTA 


6 0 


TC ACTTCCTA 


AGTTCCC TCT 


T ATATCCTC A 


AGGTAG A A AT 


GTCTATOTTT 


1 2 0 


TTCATAAATC 


T ATTC ATA AG 


TCTTTGGT AC 


A AOTT AC ATG 

* 


ATAAAAAGAA 


1 8 0 


TCTTCCCTTC 


TTTGCACTTT 


TGA A AT A A A G 


TATTTATCTC 


CTOTCTACAO 


2 4 0 












2 4 6 



TTTAAT 

( 2 ) INFORMATION FOR SEQ ID NO: 17: 

( i ) SEQUENCE CHARACTERISTICS: 
[ A ) LENGTH: 118 base paxri 
( B ) TYPE: noclric add 
( C ) STRAND EDNESS : doable 
< D > TOPOLOGY, linear 

( i i ) MOLECULE TYPE: DNA UcqoibIc) 

( a I ) SEQUENCE DESCRIPTION: SEQ ID NO: 17: 
G TC C AGT AT A A AGG A A AGCC TTAAGTCOGT AAGCTACACC ATTGTA A ATA T C TTTTA TCT 60 
CCTCTAOATA AAACACCCOA TT A AC AG ATG TTAACCTTTT ATGTTTTG A T TTGCTTTA AA 120 
AATGGCCTTC T AC A C A TTAG CTCCAGCTAA AAAGACACAT TG AOAGCTT A G AGOATAGTC 110 
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TCTCCACC 118 



( 2 ) INFORMATION FOR SEQ ID NO;tfc 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 212 bac pain 
( B ) TYPE: nodoc mid 
C C ) STRANDEDNESS: double 
( D ) TOPOLOGY: 



( I 1 ) MOLECULE TYPE: DMA (fptexsic) 

( * i ) SEQUENCE DESCRIPTION: SEQ ID NO: It: 

CCACTTCCAA OCCAGTTOOT GTOCTATTTT TOAAOCACAT OTGOTOATAC TOAOATTOTC 60 

TCTTCAGTTT CCCCATTTG T TTGTG CTTC A AATGATCCTT CCT ACTTTGC TTCTCTCCAC 120 

CCATOACCTT TTTC ACTGTG OCCATCAAOG ACTTTCCTG A C AGCTTGTGT ACTCTTAGGC 180 

TAAGAGATGT GACTACAGCC TGCCCCTOAC TO 212 

( 2 ) INFORMATION FOR SEQ ID NO: 19: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 203 bcu pan 
- ( B ) TYPE: nckk acU 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY: Uccsr 

( i i ) MOLECULE TYPE: DNA (ceaamk) 

( * i ) SEQUENCE DESCRIPTION: SEQ ID NO: I * 

TGTTAG TTTT TAGOAAGGCC TOTCTTCTGG CAGTGACGTT TATTAGTCCA CTTCTTOGAG 60 

CTAGACGTCC T AT AGTT AGT C AC TGGGG AT GGTG AAAG AG GGAGAAGAGG AAGGGCGAAG 130 

GGAAGGGCTC TTTGCTAGTA TCTCC ATTTC T AG A AG A TGG TTTAGATGAT AACCACAGGT ISO 

CTATATGAGC ATAOTAAGGC TGT 203 

( 2 ) INFORMATION FOR SEQ ID H&J20: 

{ i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 177b*jcpax» 
( B ) TYPE: oodeic acid 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: linear 

( i i ) MOLECULE TYPE: DNA (cesaBtc) 

( x i ) SEQUENCE DESCRIPTION: SEQ ID N020: 
CCT ATTTCTG ATCCTCACTT TGGACAAGGC CCTTCAGCCA G AAG AC TG A C AAAGTCATCC 60 
TCCOTCTACC AO ACCCTGC A CTTGTGATCC TA A A AT A A G C TTCATCTCCG GCTGTGCCTT 120 
GGGTGGAAGG GGCAGGATTC TCC AGCTGCT TTTOC A TTTC TCTTCCTAAA TTTCATT 17 7 

( 2 ) INFORMATION FOR SEQ [D N03L- 

( i ) SEQUENCE CHAJlACTERISTtCS: 
( A ) LENGTH: 10* bue pain 
( B ) TYPE: noetic add 
( C ) STRANDEDNESS: double 
{ D ) TOPOLOGY: 



( I 1 ) MOLECULE TYPE: DNA (tcssmic) 

( ft i ) SEQUENCE DESCRIPTION: SEQ ID N031: 
CGG AG CGT AG GT GTGTTT AT TCCTGTACAA ATCATTACAA AACCAACTCT OOGOCAOTCA 60 
CCCCCCCCAC CCATCACCCC AOTGC A A TGG CTACCTGCTG GCCTTT 106 
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( 3 ) INFORMATION FOR SEQ TD Ntm 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH U9 btse pun 
( B ) TYPE; mtefcic acid 
( C J STRANDEDNESS; dooUc 
( D ) TOPOLOGY: ttocsr 

( i i ) MOLECULE TYPE; DMA (rcuocaic) 

(if) SEQUENCE DESCRIPTION: SEQ CD X022: 
TTAOTTCAOT CAAAOCAOOC AACCCCCTTT OOCACTGCTG CCACTOOOOT CATOOCGOTT 60 
GTOOCAOCTG GGGAGGTTTC CCCAACACCC TCCTCTGCTT CCCTGTCTGT CGOGGTCTCA 120 
GOAOCTOACC C AG A GTGOA 139 

( 2 ) INFORMATION FOR SEQ ID HOJ3: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 177 buc pain 
( B ) TYPE: nucfcic tc'd 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY: feeir 

( t i ] MOLECULE TYPE: DKA (gcnonxx) 

( x i ) SEQUENCE DESCRIPTION: SEQ TD NC<23: 
GCTOAATOTT TAAGAGAGAT TTTGGTCTTA AAGGCTTCAT CATOAAAGTC TACATGCATA 60 
TGCAAOTGTG AATTACGTGG TATGGATGGT TGCTTOTTTA TTAACTAAAC ATCTACAGCA 120 
AACTGCCCGT TTAOAGTCCT CTTAATATTG ATG TCCT A AC ACTGGGTCTG CTTATGC 177 

( 2 ) INFORMATION FOR SEQ ID N024; 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 147 Use pun 
( B ) TYPE: nucleic acid 
{ C ) STRANDEDNESS: doable 
< D ) TOPOLOGY: Encir 

( 1 i ) MOLECULE TYPE DNA (jcoomic) 

( i i ) SEQUENCE DESCRIPTION: SEQ ED N024: 
CGCAGTGGGA TATOOAATCC AGAAGOGAAA CAAGCACTGG ATAATTAAAA ACAOCTGGGG 60 
AGAAAACTGG OGAAACAAAG CATATATCCT C A TGGCTCG A AATAAGAACA ACGCCTGTGC 130 
CATTGCCAAC CTOOCCAOCT TCCCCAAGAT GTGACTCCAG CCAGAAA 167 

( 2 ) INFORMATION FOR SEQ TD NO:15: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: Ul buc pun 
( B ) TYPE: ftxfcc acid 
( C ) STRANDEDNESS: double 
( D ) TOPOLOGY: luesr 

( i i ) MOLECULE TYPE: DNA (tcaomic) 

( x i ) SEQUENCE DESCRIPTION: SEQ TD NOi5: 
GCC AGGGCGG ACCOTCTTTA TTCCTCTCCT GCCTCAGAGG TCAGGAAGGA GCTCTGGCAG 60 
GACCTGCAGT GGGCCCTAGT CATCTOTGGC AGCG A AGOTG AAOGOACTCA CCTTGTCGCC 120 
COTGCCTGAO TAOAACTTOT TCTGOAATTC C 151 



( 2 ) INFORMATION FOR SEQ ID NOSfc 

( i ) SEQUENCE CHARACTERISTICS: 
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( A ) LENGTH; 156 h*x jmin 
( B ) TYPE: codcic tad 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY; Knetr 

( { i ) MOLECULE TYPE: DNA (gataaic) 

{ t I ) SEQUENCE DESCRIPTION: SEQ ID NO:2fi: 
AACTCTTTCA CACTCTOCTA TTTTTAOTTT AACAATATAT CTCTTOTGTC TTGGAAATTA 60 
OTTCATATCA ATT C AT ATTG AC CTGTCTC A TTCTTTTTTT A A T G G T CAT A TACAGTAOTA 120 
TTCAATTATA AG A AT ATATC CTAATACTTT TTAAAA 136 

( 2 ) INFORMATION FOR SEQ ID N037: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 150 buc poi 
t B ) TYPE; suefcic Kid 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY: Uncsr 

( i i ) MOLECULE TYPE: DNA (genomic) 

< a i > SEQUENCE DESCRIPTION: SEQ ID NO.Z7: 
GGAT A AGAAA G A AGGCCTG A OCOCTAGGGG CCCGGGCTGG CCTGCGTCTC AGTCCTGGGA 60 
CGCAGCAGCC CGC AC AGGTT GAG A GGGGC A CTTC CTCTTG CTTAGGTTGG TCACGATCTG 120 
GTCCTGGTTG GCCGGTGGAG AGCCACAAAA '*0 

< 2 ) INFORMATION FOR SEQ ID NOllS: 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 212 bue pdn 
( B ) TYPE: midric tad 
( C ) STRANDEDNESS: double 
( D J TOPOLOGY: linear 

( I I ) MOLECULE TYPE: DNA (genomic) 

( * i ) SEQUENCE DESCRIPTION: SEQ ID N0.2& 

GC A CTTGG A A GGG AGTTGGT OTGCTATTTT TCAAGCAOAT GTGGTG ATAC TGAGATTGTC 60 

TGTTC AGTTT CCCCATTTGT T TO TG CTTC A A A TGATCCT T CCT ACTTTGC TTCTCTCCAC 120 

CCATGACCTT TTTC ACTGTG GCC ATC A AGO AC TTTC C T G A C AGCTTGTGT AC TCTTAGGC 1X0 

TAAGAGATGT OACTACAGCC TCCCCCTOAC TG 2 12 

( 2 ) INFORMATION FOR SEQ ID NO-.29: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH; 137 bate pain 
( B ) TYPE- miUcic acid 
( C ) STRANDEDNESS: doable 
( D ) TOPOLOGY tmeir 

( t i ) MOLECULE TYPE: DNA (genomic) 

( i i ) SEQUENCE DESCRIPTION: SEQ ID N039: 
ATC CC TG GC T GTGGAT AOTO CTTTTGTOTA CCAAATGCTC CCTCCTTAAG GTTAT AGGGC 60 
TCCCTCAOTT TGGGAGTGTG CAAOTACTAC TTAACTGTCT OTCCTGCTTG GC.TGTCGTTA 120 
TCGTTTTCTO GTOATGTTOT OCTAACAATA AGAATAC 157 

( 2 ) INFORMATION FOR SEQ ED NOt30t 

( I ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 132 bue pin 
( B ) TYPE: nucleic tad 
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( C ) STRAND EDKESS; douUe 
( D ) TOPOLOOY: Bray 

( i t ) MOLECULE TYPE: DN A (genomic) 

(ti) SEQUENCE DESCRIPTION: SEQ ID NO-JO: 
CGCTGGOC AT CCCTCTCCTC CTCCATCCCC ATA CAT C A C C AGGTCTAATG TTTACAAACG 60 
GTOCCACCCC GO C TCTGAAG CCAAOGOCCO TCCOTOCCAC OOTOOCTOTO AGTATTCCTC 120 
CGTTAOCTTT CCC ATAAGGT TGGAGTATCT CC 152 

( 2 ) INFORMATION FOR SEQ ID NOJ1: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 90 bus pain 
( B ) TYPE: DKkk add 
( C ) STRANDED NESS: double 
( D ) TOPOLOGY: four 

( i i ) MOLECULE TYPE: DNA (gnome) 

(st) SEQUENCE DESCRIPTION: SEQ ID N031: 
CCAACTCCTA CCGCGATACA G ACCC A C AG A GTGCCATCCC TC AG AG AC C A GACCGCTCCC 60 
C A A TACTCTC CTA A AATAA A CATGAAOCAC 90 

( 2 ) INFORMATION FOR SEQ ID NOJ2: 

( i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 43 hue pact 

( B )TYFE: onclcjc wad - 
( C ) STRANDEDNE5S: double 
( D ) TOPOLOGY: Saw 

( I i ) MOLECULE TYPE; DNA (xeoamic) 

( i i ) SEQUENCE DESCRIPTION: SEQ ID N0J2: 

CATGGATGAA TGTCTCATGG TGGG A A GGA A CATOOTACAT T TC 4) 

( 2 ) INFORMATION FOR SEQ ID NOJ3: 

t i ) SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 2333 bue ptixt 
( 8 ) TYPE: mefcic Kid 
( C ) STRANDED NESS: double 
( D ) TOPOLOGY: lincv 

( i i ) MOLECULE TYPE: DNA (genomic) 

( i i ) SEQUENCE DESCRIPTION: SEQ ID N033: 

AGACACCTCT GCCCTCACCA TGACCCTCTO GCAGCCCCTC GTCCTOOTOC T CCTGGTOCT 60 

OGGCTGCTGC TTTGCTGCCC CCAGACAGCO CCAGTCCACC CTTOTGCTCT TCCCTGGACA 120 

CCTGACAACC AATCTCACCG ACAGGCAGCT GGCAGaOOAA TACCTOTACC OCTATOOTTA 110 

CACTCGGGTC CC AG AGA TG C CTOOAOAOTC GAAATCTCTG GGGCCTCCGC TGCTGC TTCT 240 

* 

CCAGAAGCAA CTOTCCCTOC CCOAGACCGO TOAOCTOOAT AOCGCCACGC TG A AGGCCAT J00 

GCGAACCCCA CGGTGCGGGG TCCCAGACCT GGGCAGATTC CAAACCTTTG AGGGCGACCT 360 

CAAOTCOCAC CACCACAACA TCACCTATTG OATCCAAAAC TACTCGGAAG ACTTGCCOCO 420 

GGCGGTG ATT GACGACGCCT TTGCCCGCGC CTTCCCACTC TGGAGCG CGG TGACCCCGCT 410 

CACCTTCACT COCGTGTAC A GCCGGOACOC AGACATCGTC ATCC A G T T TG OTOTCGCGGA 540 

GC ACGGAG AC GGGT ATCCC T TCG ACGGGA A GOACGGGCTC CTGGCACACG CCTTTCCTCC 600 

TGGCCCCGGC ATTCAGGGAG A CGCCC ATTT CGACGATGAC GAGTTG TG OT CCCTGGGCAA 660 
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GOCCOTCGTC 


CTTCCAACTC 


OGTTTGG AAA 


CGCAG ATGGC 


GCGGCCTGCC 


ACTTCCCCTT 


7 2 0 

* 


CATCTTCCAO 


GCCCGCTCCT 


ACTCTGCCTG 


CACC ACCGAC 


GGTCG CTCCG 


ACGGGTTGCC 


.710 


CTCGTGCAOT 


ACCACGGCCA 


ACTACGAC AC 


CGACG ACCGQ 


TTTGCCTTCT 


GCCCCAOCOA 


S 4 0 


CAGACTCTAC 


ACCCOOOACG 


OCA ATGCTOA 


TOOOA AACCC 


TGCCAOTTTC 


C ATTCATCTT 


9 0 0 


CCA AGGCCAA 


TCCTACTCCG 


CCTOCACCAC 


GGACGGTCGC 


TCCGACGCCT 


A CCOCTGG TG 


9 6 0 


CGCCACCACC 


GCCAACTACG 


ACCGGOACA A 


GCTCTTCGGC 


TTCTCCCCGA 


CCCCAGCTGA 


10 2 0 


CTCCACOOTO 


ATGGOOOOCA 


ACTCGGC GGG 


GGAGCTGTGC 


GTCTTCCCCT 


TCACTTTCCT 


10 10 


GGGTAAGGAG 


TACTCGACCT 


GT A C C AG CG A 


GGGCCGCGG A 


G A TGGGCGC C 


TCTGOTGCGC 


114 0 


TACCACCTCO 


AACTTTGACA 


GCGACAAGAA 


OTGCGOCTTC 


TGCCCGOACC 


A AOG A TAC AG 


12 0 0 


TTTGTTCCTC 


GTGGCOOCGC 


ATG AG TTCGG 


.CCACGCOCTG 


GCCTTAOATC 


ATTCCTCACT 


1 2 6 0 


OCCGGAGGCG 


CTCATGTACC 


CTATCTACCG 


CTTCACTGAG 


GGGCCCCCCT 


TGCATAAGGA 


13 2 0 


CG ACGTO AAT 


OOCATCCGGC 


ACCTCT ATGG 


TCCTCGCCCT 


GAACCTGAOC 


CACOGCCTCC 


13 8 0 


AACCACCACC 


ACACCGCAGC 


CC ACGGCTCC 


CCCGACGGTC 


TGCCCCACCG 


GACCCCCCAC 


14 4 0 


TGTCCACCCC 


TCAG AGCCCC 


CCAC AGCTGG 


C CCC AC AGGT 


CCCCCCTCAC 


CTOOCC CCAC 


19 0 0 


AGGTCCCCCC 


ACTOCTGGCC 


CTTCTACGGC 


CACTACTGTG 


CCTTTG AG TC 


CGGTGGACGA 


15 6 0 


TGCCTGC A AC 


GTG A AC ATCT 


TCGACGCCAT 


CGCGGAG ATT 


GGG A AC C AGC 


TGTATTTGTT 


16 3 0 


CAAGC ATCCG 


AAG TACTGGC 


CATTCTCTOA 


G0GCAOGGG0 


AGCCGGCCGC 


AGCGCCCCTT 


.16 8 0 


CCTTATCOCC 


G AC AAGTGGC 


CCGCCCTGCC 


CCCCAAGCTC 


GACT CGGTCT 


TTGAGGAGCC 


17 4 0 


GCTCTCC AAG 


A AGCTTTTCT 


TCTTC TCTGG 


GCGCCAGGTG 


TGCCTCTA C A 


CAGGCG CGTC 


18 0 0 


GGTOC TQOGC 




T C. C. A f* A A C. C T 




uCCCACuTuu 


CCCAGGTGAC 


18 6 0 


CGGGGCCCTC 


CGGAGTGGCA 


GGG G G A AG A T 


CCTGCTGTTC 


AGCGGGCGGC 


GCCTCTGOAG 


19 2 0 


GTTCO ACCTG 


A A GGCGC AGA 


TGGTGGATCC 


CCGGAGCOCC 


AOCGAOGTGG 


ACCGG ATGTT 


19 8 0 


CCCCGGGGTG 


CCTTTGO ACA 


CGC ACGACGT 


CTTCCAGTAC 


CGAG AG A A AG 


CCTATTTCTG 


* 

2 0 4 0 


CCAGG ACCGC 


TTCTACTGGC 


GCGTGAGTTC 


CCGGAGTCAG 


TTG A A CCAGG 


TOO ACC A AGT 


2 10 0 


GGGCTACGTG 


ACCTATOACA 


TCCTGCAGTG 


CCCTGAGG AC 


TAGOOCTCCC 


GTCCTGCTTT 


2 16 0 


GCAGTGCC AT 


G T A A A TCCCC 


ACTGGGACC A 


ACCCTGGGG A 


AGGAGCCAGT 


TTGCCGGATA 


2 2 2 0 


CAA AC TGGTA 


TTCTGTTCTG 


GAGGA AAGGG 


AGGAGTOG AG 


GTGGG CTGGG 


CCCTCTCTTC 


2 2 8 0 


TCACCTTTOT 


TT TTTGTTGG 


AGTGTTTCT A 


ATAAACTTGG 


ATTCTCTAAC 


CTTT 


2 3 3 4 



( 2 ) INFORMATION FOR SEQ ID NO-J4: 

( i > SEQUENCE CHARACTERISTICS: 
( A ) LENGTH: 18 ammo tadi 
( B ) TYPE: aniao acid 
( C ) STRANDED NESS: im(tc 
( D ) TOPOLOGY: onksown 

( i I ) MOLECULE TYPE: peptide 

(at) SEQUENCE DESCRIPTION: SEQ ID NOd4: 

Clii Ala Leu Met Tyr Pro Mel Tyt Arj Pbc Thr Glu CI y Pro Pro Leu 

H I i L y t 



We claim: a) DNA sequences set forth in the group consisting of 

1. An isolated osteoclast-specific or -related DNA « nmMn? n ^ u aM in • ^ 

sequence, or its complementary sequence, the DNA 65 SEQ ID NOS. 12, 14, 16 and 17, or their complemen- 

sequencc comprising a nucleic acid sequence selected from &ry strands; and 
the group consisting of: 



5,552,281 
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b) DNA sequences which hybridize under standard con- 
. ditions to the DNA sequences defined in a). 

2 A DNA construct capable of replicating, in a host cell, 
osteoclast-specific or -related DNA, said construct compris- 
ing: 3 

a) a DNA sequence of claim 1; and 

b) sequences, in addition to said DNA sequence, neces- 
sary for transforming or transfecting a host cell , and for 
replicating, in a host cell, said DNA sequence. 

3. A DNA construct capable or replicating and expressing, 
in a host cell, osteoclast-specific or -related DNA, said 
construct comprising: 



34 



a) a DNA sequence of claim 2; and 

b) sequences, in addition to said DNA sequence, neces- 
sary for ininsforrning or transfecting a host cell, arid for 
implicating and expressing, in a host cell, said DNA 
sequence. 

4. A cell stably transformed or transfected with a DNA 
construct according to claim 3. 

5. A cell stably transformed or transfected with a DNA 
construct according to claim 4. 
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ABSTRACT 



The present invention provides amino acid sequences of 
peptides that are encoded by genes within the human 
genome, the kinase peptides of the present invention. The 
present invention specifically provides isolated peptide and 
nucleic acid molecules, methods of identifying orthologs 
and paralogs of the kinase peptides, and methods of iden- 
tifying modulators of the kinase peptides. 

9 Claims, 41 Drawing Sheets 
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1 CCCAGQQCQC CGTAGGCGGT GCATCCCGHT CGCGCCTQGG GTTGTGGTCT 
51 TCCCGCGCCT 6AGGCQQCG6 CGGCAGGAGC T6AGQGGAGT TGTAGGGAAC 
101 TGAGGGGAGC TGCTGTGTCC CCCGCCTCCT CCTCCCCATT TCCGCGCTCC 
151 CGGGACCATG TCCGCGCTQG CGGGTGAAGA TGTCTGGAGG TGTCCAGGCT 
201 GTGGGGACCA CATTGCTCCA AGCCAGATAT GGTACAGGAC TGTCAACGAA 
251 ACCTGQCAC6 QCTCTTGCTT CCGGTGAAAG TGATGCGCAG CCTGGACCAC 

o nn rrrAATGTGC TCAAGTTCAT TGGTGTGCTG TACAAGGATA AGAAGCTGAA 

351 cctgctgaS gagtacattg aggggggcac actgaaggac thctqcgca 

401 GTATGGATCC GTTCCCCTQG CAGCAGAAGG TCAGGTTTGC CAAAGGAATC 

451 GCCTCCGGAA TGGACAAGAC TGTGGTGGTG GCAGACTTTG GGCTGTCACG 
501 GCTCATAGTG GAAGAGAGGA AAAGGGCCCC CATGGAGAAG GCCACCACCA 
551 AGAAACGCAC CTTGCGCAAG AACGACCGCA AGAAGCGCTA CACGCTGGTG 
601 GGAAACCCCT ACTGGATGGC CCCTGAGATG CTGAACGGAA AGAGCTATGA 
651 TGAGACGGTG GATATCTTCT CCTTTGGGAT CGTTCTCTGT GAGATCATTC 
701 GGCAGGTGTA TGCAGATCCT GACTGCCTTC CCCGAACACT GGACTTTGGC 
751 CTCAACGTGA AGCTTTTCTG GGAGAAGTTT GTTCCCACAG ATTGTCCCCC 
801 GGCCTTCTTC C^CTCGCCG CCATCTGCTG CAGACTGGAG CCTGAGAGCA 
851 GACCAGCATT CTCGAAATTG GAGGACTCCT TTGAGGCCCT CTCCCTGTAC 
901 CTGGGGGAGC TGGGCATCCC GCTGCCTGCA GAGCTGGAGG AGTTGGACCA 
951 CACTGTGAGC ATGCAGTACG GCCTGACCCG GGACTCACCT CCCTAGCCCT 
1001 GGCCCAGCCC CCTGCAGGGG GGTGTTCTAC AGCCAGCATT ^CCTCTCT 
1051 GCCCCATTCC TGCTGTGAGC AGGGCCGTCC GGGCTTCCTG TGGATTGGCG 
1101 GAATGTTTAG AAGCAGAACA AACCATTCCT ATTACCTCCC CAGGAGGCAA 
1151 GTOGGCGCAG CACCAGGGAA ATGTATCTCC AWGGTTCTG ataaaIaaat 
1201 ACTGTCTGTA AATCCAATAC TTGCCTGAAA GCTGTGAAGA AGAAAAAAAC 
1251 CCCTQGCCTT TGGGCCAGGA GGAATCTGTT ACTCGAATCC ACCCAGGAAC 
nn i TCCCTGGCAG TGGATTGTGG GAGGCTCTTG CTTACACTAA TCAGCGTGAC 

ml SbScctk tSg^gat cccagggtca acctccctgt gaactctgaa 
1401 gtcactagtc cagctgggtg caggaggact tcaagtgtgt ggacgaaaga 
1451 aagactgatg gctcaaaggg tgtcaaaaag to^tgatgc tooottc 

1501 TACTCCAGAT CCTGTCCTTC CTGGAGCAAG GTTGAGGGAG TAGSmTGA 
iwi AGACTf.r.nT AATATGTGGT GGAACAGGCC aggagttaga gaaagggctg 

CAGCCCAGGG 



1651 TGTGAGAGGA AGCCTCCACC TCATGTTTTC AAACTTAATA CTGGAGACTG 
1701 GCTGAGAACT TACGGACAAC ATCCTTTCTG TCTGAAACAA ACAGTCACAA 
1751 GCACAGGAAG AGGCTGGGGG ACTAGAAAGA GGCCCTGCCC TCWgMtt 
1801 TCAGATCTTG GCTTCTGTTA CTCATACTCG GGTGGGCTCC TTAGTCAGAT 
1851 GCCTAAAACA TTTTGCCTAA AGCTCGATGG GITCTGGAGG ACAGTGTGGC 
1901 TTGTCAGAGG CCTAGAGTCT GAGGGAGGGG AGTGGGAGTC TCAGCAATCT 
1951 CTTGGTCTTG GCTTCATGGC AACCACTGCT CACCCTTCAA CATGCCTGGT 
2001 TTAGGCAGCA GCTTGGGCTG GGAAGAGGTG GTGGCAGAGT CTCAAAGCTG 
2051 AGATGCTGAG AGAGATAGCT CCCTGAGCTG G^TCTGA CTTCTACCTC 
2101 CCATGTTTGC TCTCCCAACT CATTAGCTCC TGG3CAGCAT CCTCCTCAGC 
2151 CACATGTGCA GGTACTGGAA AACCTCCATC TTGGCTCCCA GAGCTCTAffi 
2201 AACTCTTCAT CACAACTAGA TTTGCCTCTT CTAAGTCTCT ATGAGCTTGC 
2251 ACCATATTTA ATAAATTGGG AATGGGTTTG GGGTATTAAA AAAAAAAAAA 
2301 AAAAAAAAAA AAAAAAAAAA (SEQ ID N0:1) 

FIG.1A 
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FEATURES: 

5'UTR: 1-228 

Start Codon: 229 

Stop Codon: 994 

3'UTR: 997 



Homologous proteins: 

Top 10 BLAST Hits 



CRA 
CRA 
CRA 
CRA 
CRA 
CRA 
CRA 
CRA 
CRA 
CRA 



1000682328847 /altid=gi 1 8051618 /def=ref |NP_057952.1 
18000005015874 /altid^gi 
88000001156379 /altid=gi 
88000001156378 /altid=gi 
18000005154371 /altid=gi 
18000005126937 /altid=>gi 
18000005127186 /altid=gi 
18000005127185 /altid=gi 
18000005004416 /altid=gi 
18000005004415 /altid=gi 



5031869 /def=ref 
7434382 /def=pir 
7434381 /def=pir 
7428032 /def=pir 
6754550 /def=ref 
2804562 /def=dbj 
2804553 /def=dbj 
2143830 /def=pir 



LIM d. 

NP_005560.1| LIM . 
X5814 LIM motif. 
X5813 LIM motif. 
JE0240 LIM kinas. 
NP_034848.1I LIM 



BAA24491.1 
BAA24489.1 



(AB00. 
(AB00. 



178847 LIM motif 



1708825 /def=sp|P53670|LIK2_RAT LI. 



BLAST dbEST hits: 



91 
9i 

gi 
gi 
gi 
gi 
gi 



10950740 /dataset=dbest /taxon=96.. 
10156485 /dataset=_best /taxon=96.. 
5421647 /dataset=dbest /taxon=9606 
10895718 /dataset=<lbest /taxon=96. . 
13043102 /dataset=dbest /taxon=960. 
519615 /dataset^dbest /taxon=9606 / 
11002869 /dataset=dbest /taxon=96.. 



Score 
485 
485 
469 
469 
469 
469 
469 
469 
468 
468 



Score 
1049 
975 
952 
757 
714 
531 
511 



E 

e-136 
e-136 
e-131 



e 
e 
e 
e 
e 
e 
e 



131 
131 
131 
131 
131 
131 
131 



E 

0.0 

0.0 

0.0 

0.0 

0.0 

e-149 

e-143 



EXPRESSION INFORMATION FOR MODULATORY USE: 

library source: 

From BLAST dbEST hits: 



gi 
gi 
gi 
gi 
gi 
gi 
gi 



10950740 

10156485 

5421647 

10895718 

13043102 

519615 

11002869 



teratocarcinoma 

ovary 

testis 

nervousjTormal 
bladder 
infant brain 
thyroid gland 



From tissue screening panels: 
Fetal whole brain 



FIG.1B 
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1 MVQDCQRNLA RLLLPVKVMR SLDHPNVLKF IGVLYKDKKL NLLTEYIEGG 
51 TLKDFLRSMD PFPWQQKVRF AKGIASGMDK TVVVADFGLS RLIVEERKRA 
101 PMEKATTKKR TLRKNDRKKR YTVVGNPYWM APEMLNGKSY DETVDIFSFG 
151 IVLCEIIGQV YADPDCLPRT LDFGLNVKLF WEKFVPTDCP PAFFPLAAIC 
201 CRLEPESRPA FSKLEDSFEA LSLYLGELGI PLPAELEELD HTVSMQYGLT 
251 RDSPP (SEQ ID N0:2) 



FEATURES : 

Functional domains and key regions: 

[1] PDOC00004 PS00004 CAMP_PHOSPHO_SITE 

cAMP- and cGMP- dependent protein kinase phosphorylation site 

Number of matches: 2 

1 108-111 KKRT 

2 119-122 KRYT 



[2] PDOC00005 PS00005 PKC_PHOSPHO_SITE 

Protein kinase C phosphorylation site 

Number of matches: 4 

1 51-53 TLK 

2 106-108 TTK 

3 107-109 TKK 

4 111-113 TLR 



[3] PDOC00006 PS00005 CK2_PH0SPH0_SITE 
Casein kinase II phosphorylation site 

Number of matches: 4 

1 51-54 TLKD 

2 76-79 SGMD 

3 139-142 SYDE 

4 212-215 SKLE 



[4] PDOC00008 PS00008 MYRISTYL 
N-myristoylation site 

Number of matches: 4 

1 73-78 GIASGM 

FIG.2A 
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" 2 77-82 GMDKTV 

3 150-155 GI VICE 

4 158-163 GQVYAD 

Membrane spanning structure a nd domains; 
Helix Begin End Score Certainty 

1 142 162 0.872 Putative 

2 184 204 0.652 Putative 

BLAST Alignment to Top Hit: , ■ ■ 

>CRA 1 1000682328847 /altid=gi 1 8051618 /def=ref|NP_057952.1| LIM 

domain kinase 2 Isoform 2b [Homo sapiens] /org=Homo 

sapiens /taxon=9606 /dataset=nraa /length=617 

Length = 617 

Score - 485 bits (1235). Expect - e-136 o«w«c /o*\ 

Identities = 241/265 (90*). Positives = 241/265 (90*). Gaps = 22/265 (8*) 

Query 13 LU ) VKVMRSIJ)HPNV1J(FIGVLYKDKKLNUTEYIEG^ 72 
query. l IxvShpwlkFICTLYKDK^ „ 

Sbjct: 353 LTEVXWIRSLDHPW^ 412 

nuPrv 73 GIASGM- - - - - DKTVWADFGI^RLIVEERKRAPMEKATTKKR 110 

query, u b guj DlCWWADFGl^RLIVEERKRAPMEKATTKKR 

Sbjct: 413 GIAS^YWSMCIIHRDUISHNCU^ 472 

Query 111 TLPJ<NDRKKRYTWGNP YWMAP EMLNGKSYDETVD I FSFGI VLCE I IGQVYADPDCLPRT 170 
query. j^J^jg^^^ 

Sbjct: 473 TIJW[)RKKRYTWGNPYVWE^ 532 

Query: 171 IJJFGLNvXLNEKNPT^ 230 

U)FGU^VKLFWEKrT/PTDCPPAFFPUvXICCRL£PESRPAFSKLEDSFEALSLYLGELGI 

Sbjct: 533 LDFGIJIVKLFWEKFVPTDCPPAFFPLMICCPXEPESRPAFSKLEDSFEAL5 592 

Query: 231 PLPAELEELDHTVSMQYGLTRDSPP 255 

PLPAElEELDHTVSMQYGLTRDSPP 
Sbjct: 593 PLPAELEELDHTVSMQYGLTRDSPP 617 (SEQ ID NO: 4) 



Humer search results (Pfara): 
Model Description 



PF00069 Eukaryotic protein kinase domain 

CE00031 CE00031 VEGFR 

CE00204 CE00204 FIBR0BLAST_GRCW1H_RECEPT0R 

CE00359 E00359 bone_morphogeneticj)rotein_receptor 

CE00022 CE00022 MAGUK_subfamily_d 

CE00287 CE00287 PTK_Eph_orphan_receptor 

CE00292 CE00292 PTK_membrane_span 

FIG.2B 



Score 


F -value N 


100.1 


l.le-26 2 


4.9 


0.14 1 


4.7 


1 1 


1.8 


7.9 1 


1.5 


2.5 1 


-48.4 


3.8e-05 1 


-61.8 


2.1e-05 1 
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CE00291 
CE00286 
CE00290 

CE00288 



CE00291 PTK_fgf_receptor 
E00286 PTK_EGF_receptor 
CE00290 PTK_Trk_family 
CE00288 PTK Insulin_receptor 



113.0 

125.1 
151.3 

210.4 



0.027 1 

0.0021 ] 

6.5e-05 1 

0.014 1 



Model 



PF00069 1/2 

CE00022 1/1 

PF00069 2/2 

CE00031 1/1 

CE00204 1/1 

CE00359 1/1 

CE00290 1/1 

CE00287 1/1 

CE00291 1/1 

CE00292 1/1 

CE00288 1/1 

CE00286 1/1 



sea-f sea-t 




hrrni-f hntn-t 


score 


E- value 


16 


79 


* • 


41 


105 .. 


52.1 


2.3e-13 


124 


153 




187 


216 .. 


1.5 


2.5 


81 


156 




,129 


182 .. 


48.0 


3.1e-12 


129 


156 


* * 


1114 


1141 .. 


4.9 


0.14 


129 


156 


* * 


705 


732 .. 


4.7 


1 


79 


157 


* * 


287 


356 .. 


1.8 


7.9 


9 


218 


* ■ 




282 [] 


-151.3 


6.5e-05 


1 


218 


C 




260 [] 


•48.4 


3.8e-05 


1 


218 


[. 




285 [] 


-113.0 


0.027 


1 


218 


[. 




288 [] 


-61.8 


2.1e-05 


1 


218 


[. 




269 [] 


-210.4 


0.014 


6 


218 


« * 




263 [] 


•125.1 


0.0021 



FIG.2C 
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1 TCATCCTTGC GCAGGGGCCA TGCTAACCTT CTGT GTCTCA GTCCAATTTT 
51 AATGTATGTG CTGCTGAAGC GAGAGTACCA GAGGTTTTTT TGATGGCAGT 
101 GACTTGAACT TATTTAAAAG ATAAGGAGGA GCCAGTGAGG GAGAGGGGTG 
151 CTGTAAAGAT AACTAAAAGT GCACTTCTTC TAAGAAGTAA GATGGAATGG 
201 GATCCAGAAC AGGGGTGTCA TACCGAGTAG CCCAGCCTTT GTTCCGTGGA 
251 CACTGGGGAG TCTAACCCAG AGCTGAGATA . GCTTGCAGTG TGGATGAGCC 
301 AGCTGAGTAC AGCAGATAGG GAAAAGAAGC CAAAAATCTG AAGTAGGGCT 
351 GGGGTGAAGG ACAGGGAAGG GCTAGAGAGA CATTTGGAAA GTGAAACCAG 
401 GTGGATATGA GAGGAGAGAG TAGAGGGTCT TGATTTCGGG TCTTTCATGC 
451 TTAACCCAAA GCAGGTACTA AAGTATGTGT TGATTGAATG TCTTTGGGTT 
501 TCTCAAGACT GGAGAAAGCA GGGCAAGCTC TGGAGGGTAT GGCAATAACA 
551 AGTTATCTTG AATATCCTCA TGGTGGAAAG TCCTGATCCT GTTTGAATTT 
601 TGGAAATAGA AATCATTCAG AGCCAAGAGA TTGAATTGTT GAGTAAGTGG 
651 GTGGTCAGGT TACAGACTTA ATTTTGGGTT AAAAAGTAAA AACAAGAAAC 
701 AAGGTGTGGC TCTAAAATAA TGAGATGTGC TGGGGGTGGG GCATGGCAGC 
751 TCATAAACTG ACCCTGAAAG CTCTTACATG TAAGAGTTCC AAAAATATTT 
801 CCAAAACTTG GAAGATTCAT TTGGATGTTT GTGTTCATTA AAATCTCTCA 
851 CTAATTCATT GTCTTGTCCA CTGTCCGTAA CCCAACCTGG GATTGGTTTG 
901 AGTGAGTCTC TCAGACTTTC TGCCTTGGAG TTTGTGAGAG AGATGGCATA 
951 CTCTGTGACC ACTGTCACCC TAAAACCAAA AAGGCCCCTC TTGACAAGGA 
1001 GTCTGAGGAT TTTAGACCCA GGAAGAATGA GTGATGGGCA TATATATATC 
1051 CTATTACTGA GGCATGAGAA GAGTGGAATG GGTGGGTTGA GGTGGTGTTT 
1101 TAAGGCCTCT TGCCAGCTTG TTTAACTCTT CTCTGGGGAA CGAGGGGGAC 
1151 AACTGTGTAC ATTGGCTGCT CCAGAATGAT GTTGAGCAAT CTTGAAGTGC 
1201 CAGGAGCTGT GCTTTGTCTA TTCATGGCCC CTGTGCCTGT GAAACAGGGT 
1251 TCGGTGACTG TCAGTGTGCC TGTGGCAGTC TGTAGTTACC CAGAGAGAAC 

1301 AAAGCTGCAT ACACAGAGCG CACAAGGGAG TCTTGTAACA ACCTTGTCCT 
1351 GCTTTCTAGG GCTGAGTCAG GTACCACAGC TTGATCTCAG CTGTCCTCTT 
1401 TATTTCAAGA AGTTGACATC TGAGCCATAC CAGGAGTATT GTATTTTGTT 
1451 TGAGGCCTCT CTTTTTGGAG GAACATGGAC CGACTCTGTG CTTTTGTCTA 
1501 TGCTGGTCTC TGAGCTCACA CAACCCTTCA CCCTCCTTTC TCAGCCAGTG 
1551 ATAGGTAAGT CTTCCCTATC TTGCAAGGCT CAGCTCAAGT GTCAGCTTCC 
1601 TCTACAAAGA CTTTCCTGGT TCCCCTCATT GGAGTGAACA AGAGTTGACA 
1651 TGGTAGAATG GAAAGAGCAG AAGCTTTAGA ATGAGCCAGA CCTGAGTATG 
1701 AATGCTAGAT CCACCACTTA GCTAGTCAAC CCTGCCCCCT GCCTCAAGTT 
1751 TTAATTTTCC TATCCATTAA GTGAATATAA TAATACCTGT GTCACAGGAT 
1801 TATnTGAGA ATTAAATGAG ATTAGGTCTA TGAAAGCACC TAGCAGAGTT 
1851 CTTGGCATAT AGGAGGCATT CATTAAATAT TTGTTCTTCC CCTTTTATAC 
1901 CCATTACTTT TCTTTTTCTG AACTAAAATA ATACTTGGTT CTATCTCTGA 
1951 AATAACATCC AAGTGAAAAA TCAACAACAT GAAAGAGCAG TTCTTTTCCA 
2001 GTGGATTTGC TTCTTAAGGA GCAGAGATTA TGTAATCTAA CAGCCTCCAA 
2051 CATACAAAGA GCTTTGTATC TAGAACAGGG GTCCCCAGCC CCTGGACCGC 

2101 CAACTGGTAC GGGTCTGTAG CCTGTTAGGA ACCAGGCTGC ACAGCAGGAG 

2151 GTGAGCGGCG GGCCAGTGAG CATTGCTGCC TGAGCTCTGC CTCCTGTCAG 

2201 ATCAGTGGTG GCATTAGATT CTCATAGGAG TGTGAACCCT ATTGTGAACT 

2251 GCACATGCAA GGGATCTGGG TTGCATGCTC CTTATGAGAA TCTCACTAAT 
2301 GGCTGATGAT CTGAGTTGGA ACAGTTTGAT ACCAAAACCA TCCCCCCGCC 
2351 CCCCAACCCC CAGCCTAGGG TCCGTGGAAA AATTGGCCCC TGGTGCCAAA 
2401 AAGGTTGAGG ACTGCTGATC TAGAGGACCA ATTTATTCAA TGTTGGTTGA 
2451 GTAAATGAGC TCTTGGATTA GGTGATGGAA AAATCTGAAA AAACAGGGCT 
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2501 TTTGAGGAAT AGGAAAAGGC AGTAACATGT TTAACCCAGA GAGAAGTTTC 
2551 TGGCTGTTGG CTGGGAATAG TCATAGGAAG GGCTGACACT GAAAAGAAGG 
2601 AGATTGTGTT CGTTTCTTCT TCTCAGAGCT ATAAGCAAAG GCTGAAAGTT 
2651 CTAGAAAAAG GCAAGTTTTG TTTCAGTAGA AAAAAGGATA ATCAGAACCA 
2701 TTTTTAGAAA ATGGAATGAG ACTACTTTTG AGGCCATGAG TTCCTTGTCC 

2751 CTGGAGAGAT GAGCAGAGGT TGGACAAGTG CTTACCAGAG ATCTTGTGGA 
2801 GGCAGAAACT GTGCATCTAG CAGAGCATTG GCCTAACCCT TTCAAATGAG 
2851 ATGCTGTTAA CTCAGTCTTA TTCTACATGG TAGGAATCCT GTCCCTTTGC 
2901 CTCCTGCTAC TTTGGGCCTC TCAACCTCTT GGTTTTGTGT GCAGGTGAAG 
2951 ATGTCTGGAG GTGTCCAGGC TGTGGGGACC ACATTGCTCC AAGCCAGATA 
3001 TGGTACAGGA CTGTCAACGA AACCTGGCAC GGCTCTTGCT TCCGGTAGGT 
3051 GGGCCTATCC TCCCATCTTT ACCAGTGTAC TATGGGCCAA GCACTATTTC 
3101 ATGTTCTGAT GGAAAACACA GAAACAAGCT TCTGAGTTGA GAATTTCAAT 
3151 CTTAGGGTGG GGAAAGGAAT GTACCAAGGA AGAGCTCATG ACCAAACCTC 

3201 AAGTGTGGCC CCCCTGAACC CAGGTTAAAT TGGAAGAGCC ATAAATGGGC 

3251 CAGCTGGAGG CAGGGTGGGG GGATGAGAGG AGCCCTTTCC AGGGTTGTCC 

3301 CATATCCCTC ACTTTATGGG TGAGGAAACT GAGGCCCAGG AAGAGTGACT 

3351 TTCCTGTGGC TGCACTACAG ATTATGCAGG TACTTCAAGA GTTGTTTGTA 

3401 TTCTTATTTT ATTTTATTTT ATTTTATTTT ATTTTATTTT ATTTTATGAG 
3451 AGGGA7TCTT GCTGTTGCCC AGGCTGGAGT GC AGTGGT GC AATCTCGGCT 
3501 CACTGCAATC TCTGCCTGCT GGGTTCAAGT GATTTTTCTG CCT TAGCTT C 
3551 CTGAGTAGCT GAGATGACAG GCACCTGCCA CCATGCGCAG CTAATTTTTG 
3601 TATTTTAGTG GAGACGGGGG TTTCAACATG TTGGTCAGGC TGGTCTTGAA 
3651 CTCCTGACCT CAAATGATGC ACCCACCTCG ACCTC CCAAA GTGCTGGAAT 
3701 TACAGGCGTG AACCACTGTG CCCAGCCAAG AGTTGTmT AGTGTGGTTG 
3751 GCAGAGCCAG CTCTTCCTTC ACCACAGGAT GCCTCCCTAG GTTCCTACTT 
3801 TTTGTTACTA GCTTTTATTA TAGCTATATT ATTAnATTA TTATTATTAT 
3851 TATTATTATT ATTATTGAGA CAGAGTCTCG CTCTGTCGCC CAGGCTGGTG 
3901 TACAGTGGTG CGATCCCGGG CTCACTGCAA CCTCTGCCTC CCGAGTTCAA 
3951 GCAGTTCTCC TGCCTCAGCC CCCCGAGTAG GTGGGACTAC AGGCGCCTGC 
4001 CACCACACCC GGCTAATTTT TGTATTTTTA GTAGAGACGG GGTTTCACCT 
4051 TGTTGACCAG GCTGGTCTGG AGCTCCTGAC CTC AGGTMG TGCTAGAATC 
4101 ACAGGCGTGA ACCACTGCGC CCAGCCAAGA GTTGTTTTTA GTGTGGTTGG 
4151 CAGAGCCAGC TCTTCCTCAC CACAGGTTGC CTCCCTAGGT TCCTACTTTT 
4201 TGTTACTAGC TTTATTATAG CTACATTATT ATTATTATTG TTATTATTAT 
4251 TGAGACAGAG TCTCGCTCTG TCGCCCAGGC TGGTGTACAG TGATGTGATC 
4301 TTGGCTCACT GCAACCTCTG CCCCCCGAGT TCAAGCAATT CTCCTGCTTC 
4351 AGCCCCCCTA GTAGGTGGGA CTCCAGGCAC CTGCCACCAC GCCCAGCTAA 
4401 TTTTTGTATT TTTAGTAGAG GCGGGGTTTC ACCTTGTTGG CCAGGCTGGT 
4451 CTCAAACTCC TGACCTCAGG TGATCCGCCT GCCTCGGCCT CCCAAAATGT 
4501 TGGGATTACA GGCATGAGCC ACCGCGCCCT GCCTATAGCT ACATTATTTT 
4551 TGTAGGCAGC TCAGTTTCTT AAAAATTATA CAGACTTCAA ATCAGATTTG 
4601 TTCCTGCTGT CTGAGGCTCA GTTTCTTCAT CTGGAAAATG GATGGTAATA 
4651 ATCTTGTTGA GATTGAATGA AATAATATAT GCAGTGTATC CAGTACATGG 
4701 TAGACACCCA GTGAATGGTT ATTCCTTCCT CCCATCGGAT TGGAATTCTC 
4751 AAGGGTGGGA ACTTGTCTTT ATATTCTTCA CAACGTAAAA TAGTTGAAAT 
4801 TTGTTGGTGG AAAGAAGAGC AGTCCACTCC AGAGGCTGGA TGGGCATGCC 
4851 TGGCCCCCAA GGTCTGAAGT GGTAGGGCTG TGCCTATATC CTGAGAATGA 
4901 GATAGACTAG GCAGGCACCT TGTGCTGTAG ATTCCAGCTC CTGCACATAG 
4951 CTCTTGTTGT AAAACATCCC TGTGCTTATA CCAAGTAATT GAGTTGACCT 
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5001 TTAAACACTT GCCTCTTCCC TGGGAACCAT ATAGGGGATT GGCCTGGAGA 
5051 CGTCTGGCCT CTGGAAGAGT TGGAAAGCAG CCATCATTAT TATCCTTTCC 
5101 TTTCAGCTAT AACTCAGAGC TCTCAAGTCT TTTCTGTGGA TCTTATTGCC 
5151 TTGGTTCTTG CCCCTTTTAC TCCCAGGGM GTTGATTCTG TCTTTTCTGT 
5201 TCCATTTAGT ATGACAGGAG CAGAGAATGT CAGAGCTGTA AGGGACCTTA 
5251 TAGTTAAAGC CTTTGGCTGG TCCTTTCATT TTATAGCTGG GACTAATAAG 
5301 TAACGTCAAA ACCCAATGAG TTCACAGATT GGGTCTCGCC TTGGCATGTA 
5351 ACCCATATGT TCATATTCTT GCTGTTTTCC TATGTGTATG AATATTTTCT 
5401 ATCCAAAATA AGCAGGACAG GGTAGAGCAA GTTAATCTTT GGAATTTCTG 
5451 GATTCTCTTA GAGCTAAAAA ACTTCAGAAC TAGAAGAAAC CACCCACTAT 
5501 ATGGTATAAC CCATTCATAT CACAGATGAG GCCTGAAACC AAAAAGACTT 
5551 GCTCAGGCCA TGGATGACAA GAGCTGGCCC TAGCACTGAA CTCTTGGGTC 
5601 ATTTGTAGGT CTAGTCAGAT GCTAGCTTGT TAGCTCTGTG CGTGCGTGTG 
5651 TGTGTGTGTG TGTGTGTGTG TGTGTGAGAT AGAGACAGAA AGATAACATA 
5701 TGTACACAAA TACATAAAGA GGAAGTAGAC ACGTTAGCAT GGTAGATAAG 
5751 AGTACAGGCA GGCCAGGCGT GGTGGCTCAC GCCTGTAATC CCAGCACTTT 
5801 GGGAGGCCAA GGCAGGTGGA TCACCTGAGG TCAGGAATTC GAGACCAGCC 
5851 TGACCAACAT GGTGAAACCC CATCTCTACT AAATACAGAA AAAAATTAGC 
5901 TTGGCATGGT GGCACATGCC TGTAATCCCA GCTACTTGGG AAGCTGAAGC 
5951 AGGAGAATCG CTTGAATCCG GGAAGCAGAA GTTGCAGTGA GCCGAGATTG 
6001 TGCCATTACA GTCTAGCCTG GGCAACAAGA GGGAAACTCC ATCGCAAAAA 
6051 AACAACCACC ACCAAGAGTA CAGGCTATGG AATGAGACTA TGGTTTTAAA 
6101 TCCTGGCTTT GCAATTTATT AACTAGCCTT AAGTGACTTC CCTGAGCTTC 
6151 AGGCACCAAT CTGTAAAATG AGGATAAGAA TATTACTCAT GCCACATGGT 
6201 TGTTAGGGAG GATTAAATGT GATAACCTAT ATAAAGTGGC TAGCATAGCA 
6251 TCTGACATAT AGAAAACTCT TAATAGGGCC GGACGTGGTG GCTTATGCCT 
6301 GTAATCCTAG CACTCTGGGA GGCCGAGGCA GAAGGATCGC TTGAGCCCAT 
6351 GAGCCCAGGA GTTTGAGACC AGCCTGGCCA ACATGGCAAA ACTCCACCTC 
6401 TACAAAAAAT ACAAAAATAT TAGCCAGGCG TGATGGCACA CACCTGTAGT 
6451 CCCAGCTACT TGGGAAGCTG AGGAGCGATG ATTACCTGAG CCCAGGGATA 
6501 TCAAGGCTGT AGTGAGCTGT GATCATGCCA CTGTACTCCA TCCAGCTGGG 
6551 GGACAGAGTG AAACCCCTGT CTCAAAACAA AACAAATGAA AAAAAAAACC 
6601 CTTAATAATC AGTAACTGTC ACTTTATATT ATGTTGTGAG TGTGTGTCTA 
6651 TATACACCTA TATGTATACA TTTCTCTTAT TACACATTCA TTGGTGATCT 
6701 GATGTGGAGC CCCAGGGATT AAGGGCAACT TTGAACTACC CTGACACAAT 
6751 CAAGCCAAAT ATCATTCCCG TGGAGGAAGT AGAGTATCTA GGTTCTGTCT 
6801 CCTAGTTGCA GCTTTACCTT GAGGACAGAG ACTCTAATCC AGCTGTGCTG 
6851 AAGGAGCACA TCTCCTGACT TCTGAGCTTT CCCCTGGTAA ATTCAAACTG 
6901 GATGTCACGG CGCCCTCAGA TAGAGCCTGG TAATTTGCCC TGGGGAGAGT 
6951 GACTGTCTTT TGGATCTAAT TTGACTTTTG CCCCAGTTGG AGGAAAATCT 
7001 TCAGGGCTAG GAAGGATTGT ATTTGTCTGA CCCCAGAGAT AACCTGGGTT 
7051 TTGAGGAACA TGGGGCATCA ACCTGAATGG TCTTGTAAGA TCTCTCCCAC 
7101 GCCAGCTTGC CAGTGTTTCT CTGATGAATT TAGAGTACCT GAGTAGTGCA 
7151 GGCCTGCTGG GAGGAGGACT CTCCCTCTGT GCTACTCAGA GAAATTCATT 
7201 CTTCAAGGCC CCCTTCCAGC CTTGCTCTTA CCCAGCTGGG CTACAGTTAC 
7251 AATAAAGGAA ATGACTTTTC TTCTCCCCTT CCCCCAGTAC CTTTGTTTTC 
7301 CTAGTCACAG GGTGGGGCTG GATATTGAAT GGAGAAATTG CTGGGGTCCA 
7351 TCCTAAACTC CTCCCCTCAT CTCTCCCTTA CATTACCCCA TTCTTCTGTC 
7401 TGCAGCCACA TCCATAATCC TGCCTCTGTT AGCCTTCCGA CAGACCCTCA 
7451 GGTGCCCAGG ACAACAGGAA GCTACTTAAA GCTGGAACCT CAGACTGTGC 
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7501 AATGGAGGCC AGTGACAAAA CTGAAAGTAG CTCTGTCAGT AATTGTGCTG 
7551 GTGCGATTAG GCAGCTGGCC AGAATCTTTT GGATCTCCTG GAC ATATGGC 
7601 TGACTAGTCC TCCCAAGCCT TCCCAACAGG CCTC 1 1 1 1 1 1 TTCCllll li 
7651 TCTTTTCTTT TTTTTCTTTC TTTCTTTCTT TCI 1 1 1 U 1 1 1 1 1 1 1 1 1 IAS 
7701 GGTAGTGAAG TGAAATTGTG GGAGTGGAAA AGGAACAAAG AAATCGGTAA 
7751 CTGGTAGTGA TCAATTACTT GTAAACACTA TTGTACTTGG ACCAGCCCAG 
7801 TAGGCCTTTT TTAAAACTCT GAGTTACCTC TCTTTCCTTT CCTTGAGCAG 
7851 TGCCATTAAT TCTGTATCTG GGGCAATCCT TTCTGATGTT CTCTGGACCT 
7901 GGCTCTCTCT CCTTAGGAGA GGCCAGGAGA GTAGCCAGAG AGCATGTCAT 
7951 TT6TAGCTGA GGTTAAAGTG TGGAGCTATC A ATGGTGA CC TGGCCTCTTG 
8001 GCATGTTAGC AAGCCAGAGG ACCTTGACAA CTTTTTTGAT GATTGTCCGT 
8051 TCACCCTGAT CAAAGGTGTT TGGCTTAGGA GGAGGGAAGA AAAGCTACCC 
8101 CTATTAGTCT TGATGGCCCC AGCGTGGGTC TCTATTGCTT GACCTGGTTC 
8151 CTAGCAGCAT TATCAGAAGG AAAATCCACC GCTCTTAAGG CTCCTGGGAA 
8201 CTTTCAGGAC 7TCCTTTCTC AGGATTGCAA ACATAAGACT ATTTGAGCTT 
8251 TCACTTTTGA AAAGCGGTTA CTAATACCTA TACTCTGGGA AAGGGCTAAT 
8301 GCAGATAGAA GACTGTGGTC ACTGCATCAG GCAACA6ACC ATTTCCGCTA 
8351 AATTTAGTGA CTCCAGGAAG GCCAGTGAAG AAATAACACA CGTAGCAACC 
8401 AGAGACTGTG TTGTAATATG TTGGCTGACA GCAGGGTACT TTCTGTGATG 
8451 CTGAAAGCCA CATTCATTTT CTCTCCCCTC ATCCCCATCT AAGCAAGCCT 
8501 GGTAGAATCA TAATTACAGT AATAGGTACC ACnATTGAG TACTCTGTGC 
8551 CAGACACCCT CCTGAGCATA CGACATGCAT AGCACATTTA ATCCTTACAA 
8601 TGACTTAATA AAATGTAGTA CTAGTCTTAC CTACTTCGAG AATAGGGAAA 
8651 TGGAGGTTAC TTGTTTAAAG TCACAGAGCT AATAGGTAGC ATAGCTGAGA 
8701 TTTGAACTCA GGCATTCTTA CTCCTTGCCT GCAAGAGTCT CTTGGCATTC 
8751 TTGAATGCAA GCATATTTCT TAACCTCACT GAGGCTCAGT TTCCTCTTAT 
8801 ATAATATGGG GTAAAGAGCC CTCACCCTGC CTGCCACACA CTGGTAGTGT 
8851 CAGATAACAT TGAAGGGTGT TAGTTTAAAG GCTTCATGGA CTCTATAATG 
8901 TCAACAAAAG TGCTGTTAAC TTTCTTCTGG GTCTCAGGCT CCTGATGTAG 
8951 AGTCAGTGGA GCAACCCTGC CATCTGCTGT TATGCTGTTG ATGTTGCTGC 
9001 CACACTTACT AACCTAAACC TTTGATTCTG GCTGTGGCCT TCTCCAGAAG 
9051 GTGTTTACTC ATTTGTCCAG TTTATCTTTT AGGAAACAGC CAGCCCGTAG 
9101 ATCATTAAGG CTGGCTATTG GACAGGGGGC TGGGGCCTGC CTGACAGAGG 
9151 AAGGAAGGGC AGACATCTGG TTCTTCCTCT GCCCCTACAA GAGACTCCAG 
9201 CCTGACCACA GAGTGGTACT CCTAGGATGT AGCAGCAGCA TATGAGCTTG 
9251 AATGTGCCTT AATCCTGCTC TTTACTTTGA GAAGAGAGAA CTAAGGACCC 
9301 ACAGATGTTT CACAGCTTCT ATAGGAGGCA GAGGTAGAAA AATGGAGAGA 
9351 GATGAGGCCA GAGATAGATA ACTGATATTA ATTAAACGTT GTATTAAGAA 
9401 CCTCACTTAG ATTATCTGAT TCAATCTTCA TAATAACCCT GCAACCCCCA 
9451 CCII M I N G AGAACAGGGT CTTGCTCTGT TGTCCAGGCT ACAGTGCACT 
9501 GGTACAATCA TAGTTCACTG CAGTGTCAAC CTCCTGAGCT CAAGCAATCC 
9551 TCCCACCTCA GCCTTGCAAG CAGCTTGGAC TACAGGCGTG CCACCACACC 
9601 TTGCCATTTT TTTTTATTTT AAGTAGAAAC AAGGTCTTAT TAATACTATG 
9651 TTGCCCAGGC TGGTCTTGAA CTCCAGCGAT CCTCCTGCCC CAGCCTCCCA 
9701 AAGTGCTTGG GATTACGGAA GTAAGCCACT GTGCCTGGCC AGTGCAACCC 
9751 CCATTTTATA CTAAAACAGG AAGGCCCAGA AAGGTTTGGA GTAACTTGTC 
9801 CAGGGTCACA CAGATGATAT TTGAACTCAG GTCTCCCTGG CTCCCAAGAG 
9851 AGTCTGCTTT CCACTAGGAC TCCCAGGAGA AAAAAAAAAA AAAAAACAGT 
9901 AGACTTGGAG ACAGAAAATC TGATTTGAGT CTTAGTTGAG CTAGGCTAAC 
9951 TGTGTAACTG TGGGCAAGTT CCTTAGCCCC TGTGAGCCTC AGTTTCTTAT 
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10001 CT6TAAAATG TCATAAAAGA AATCCATCTC ATGGAGTAGT TGTGATGATC 
10051 AAGGACTCTG AAAACATTAG AATGGTTTAA TGTGAAGGAT TAGCAGCAGC 
10101 ACATGGCAAC ATTGTGCATC TTATATTAAC TATCCAAATA TATCAAGCGT 
10151 CATTTGCTAT ATATAAAAGT CATCAAATTA GGCACTGTGG GGGATACGGA 
10201 GTTGGCATAC TAGCCTGGCC TCTTAATTAA TTCATTAATT AGCTTATTTA 
10251 TTTTTGAGAT AGGTCTTGCT CTATTGCCCA GGCTGGAGTG CAGTGGCATG 
10301 ATGATAGCTT ACTATAGCCT CAATCTCCCA GGCTTAAACA ATCCTCCTGA 
10351 GTAGCTGGGA CTACAGGCAC ACACTACCAT GCCCAGCTAA TTTTTTTTTA 
10401 ATTTTTTGTA GAGACAGGGT CTTGCTCTGT TGCCCAGGCT GGTCTCAAAC 
10451 TCCTGGGCTC GAGATCCTCC CACCTGGGCC TCACAAAGTG TTGGGATTAC 
10501 AGGTATGAGC CACGGCACCT GGCCTGGTCT CTTAACTGGT TCCCTAAGAC 
10551 AGCTGGAAAT AGAGAATGTC ATGGAGCATT CCTAACCATG GGCTCCAGCC 
10601 TGGCTTTCAT TCTGTTTCTC CCCTGAAACA ACATTCCTTT AGTAATATTC 
10651 CGAATAACAG CTTCATCAGT CTGTCTACCG ACCACTCTTC AGGCTTCATC 
10701 TTATATGACC TCCCAAACTG CACTAAGGGT TGTATTAGAG AAAAGTGGAT 
10751 AAAGTTCGGA GTCAGGCTGC TTGAGCTTAA ATGCCAGCTT CACTTACCAG 
10801 CCACCTGACC ATGAGTCAGC TGCTTAACCA TTCTTTGCCA CAGTTTCCTT 
10851 GTCTATGAAA AGGGAAATGG CTCCCACCTC AAAAAGTTGT TAACATTAAA 
10901 TTCAATCATG TATTCAAAGT CCTGAGCAGA ATGTCTGGCC ATGACTGGGA 
10951 CTTAACAGAT GTTAGCATTT ATTATTAGTA TCTGTCAGTC TTGAMTGTT 
11001 CTCTTCCCTT GGCTTTCATG ACATTCCACA CTCTCCTGGT TTTCTCTTAC 
11051 CTCTCTGGTA ATACCTGTTT GCTTATCCTT CTTTGTCCAG CTCTGGGATG 
moi ttacSttcc TiS«DBre CTirrrrcTC cttaggcagt cttacacaca 
11151 CTCATGACF CCnCCAHG TCCTCCACAC ACTGATGACC CTAAAATCAG 

11201 TATCTCCAGC CTAAACCTTT CCACTGAGTT CTAGACCCAT ATGTTGTACT 

11251 ATCAACCTGG CTTGTCCATT TGAATGTCTT CCAGGCACTT CAGACTCTCT 
11301 TCTCTAGACT TTGCTGGACT TTCACTCTTC CCCCTAAAAC TGGCTCCTCT 
11351 TCCACTGAAA CATGTATGTC ATTGAGAGGC ACCACCATCC ACCCAGTGCC 
11401 TAAGCCAGAA ACCTAGGAAT CCTTGATACC TGTTCTCTCT CATCCTGCAT 
11451 ATCCAAGCCT ATCAGTTTTA TCTCTAAATT ATATTTTGGT AGGTTTACTT 
11501 CTTTCCTTTT CTCCCACCAC CACCCTGCTC CAAGCTACCA TCATCTCACC 
11551 TGGATGTCTG CAATAGCCTC ATCTCCCACA GCCACTCTGC ACCCCCTAAT 
11601 CTGTTCTCTA TAGAGCAGTT GGAAGGAGTG ATTTTTGTTG TTTGTTTTGT 
11651 TTTGTTTTAG ACAGAGTCTC ACTCTGTTCC CCAAGGCTGG AGTGCAGTGG 
11701 CACAATTTCG GCTCACTGCA ACTTCTGCCT CCCGGGTTTA AGCAATTCTC 
11751 CTGCCTCAGC CTCCCAAGTA GCTGGGATTA AGGCACCGGC CCCCATACCC 
11801 AGCTAATTTT TATATTTTTA GTAGAGATGG GGTTTTGCCA TGTTGGCCAA 
11851 GCTAGTCTCG AACTCCTGAC CTCAAGTGAT CCACCTGCCT CGGCCTCCCA 
11901 AAGTGCTGGG ATTACAGGTG TGAGCCACTG CACCTGGCTG GAAGGAGTGA 
11951 TCTTAAAAM AAAAAAAACA AAAAAAAACT TGACTGTGTC ACTCTGTGTT 
12001 GTCTCTCCTA CCTTGTATAC TTCCACAACT TCCCAGTGTT CTTGGATAAA 
12051 GACCAAAATC CTTAACTTGG CCAGGCGCGG TGGCTCACAC CTATCATCTC 
12101 AGCACTTTGG GAGGCCGAGG CAGGCAGATC ATGAAGTCAA GASATTGAGA 
12151 CCATCCTGGC CAACATGGTG AAACCCCATC TCTACTAAAA JTACAAAAAT 
12201 TAGCTGGTCG TGGTGGCGTG TGCCTGTAGT CCCAGCTACT TGGGAGGCTG 
12251 AGGCAGGAGA ATCACTTGAA CCTGGGAGGC AGAGGTTGCA 6TGAGCCCAG 
12301 ATCACGCCAC TGCACTCCAG CCTGGTGACA GAGTAAGACT CCATCTCAAA 
12351 AAAAAAAAAA AAAAAAAAAA TTCCTTAATT TGGCCTACAG TAGAGCCCTC 
12401 CGTAATGTGG CCTCTCTCCA CATCTCCACA ACCTCCTGCT CCCTGCACTT 
12451 CAGCCTCACC TCTCTTCTGG ACAGGCCCTC CTTCTGACAA GGGCTTTGTT 
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12501 CATTCTGCTC CCTCTGCCTA GAATGCCCCC TTACTCTGTT CACTTAACTC 
12551 CTGCTTATCG TTTAGATCTT TACCTGGATG GCTCAGAGAA ATATAGAAGT 
12601 AATTCCTCAC CCTGAAAAAT AGGTTAGGTC CCTGTTTTAT GTTTTCATAG 
12651 ACCTTTCCTT TGAGGCTTTT TTTAAAAAAG TAGTTTTAAT CTCACATTTA 
12701 TTCATGTGAT CATCTCCTTA ATGATATCTT AAGACCTCTA ATAGAACAAT 
12751 TTGGTCATGG ACTGTGGGGT TTTTGCCCCT CATTGTGTCA GCACTGAGCA 
12801 TATTGTTGGC ATAGGAGGGA TATTTGTTGA ATGAATTGCT AGAGGTGGCC 
12851 AAGAGATATG ATGTAAGTCA GGCTTTTCCC TGCCCTTCCC CTTCCCCTTC 
12901 CCCACATCCT TCCTATAGCA GCCACCGTGG CTGCAGTTAC TGTAAATGGC 
12951 AAGACGGAAT CAGTTCCGGA CATTGGGTTG TTTTAGAAAA TTGCCTGCAA 
13001 GTGTCAGGGT GATAAGTTM AGCTTTGTCT TTTGCCCTCA GAGGAGCTAT 
13051 CCCATAGTGA GTAGAAGCCA GAGAAGCTGA CCCCAGGAGT CCTTCTTTCC 
13101 AGCAGCAGGT CTTGAGCTGC ACTTCTCTGT AGCTACAATC CAGGCAGGAA 
13151 CAAGCCCTAG GTACCTCCGG AGAGGAGGGC AAGAGAGGAA GAATGAGTTC 
13201 AGCTACTCTA GCCACCAAAC TGATTATGAA TTGCCCTGAA ATCTGAAAAA 
13251 TTTCAATTCC AATCGTAAGT TTGTTTTGTT TCATTTTGTT TTCTTAAATT 
13301 GTATATTTGA AAGATGGCAT TAACTAAAGA TATATATTCA ATATAGAGTG 
13351 GAAAAAATGG AATACTTGCA TAGTATCTTT TACTTATAGG TGATTTATGA 
13401 TGGGGAGTGG GGTGGATAGG TTGGCAGTTC CCCCAAGAAG TTGGAAATGA 
13451 AGTTTGTCCT CTGTGAGTTG AACTAATTAG ATCCACAAGT AATGAAAGCA 
13501 GTATTGTGTT GTAGTTAAGA GCACACTCTA GAACCAGATT GCTTAGTTTC 
13551 AAATCCTGGT TCTGCCtTTT ATTATCTGTG TACTTTGGGC AAGTTACTTG 
13601 CCCTTTGTGT GCTTCATTTT TCTCATCTAG AAAATGGAGA GGCCAGGCGT 
13651 AGTGGCTCAT GCCTATAATC CCAGCACTTT GGGAGGCCGA GGCGGGCAGA 
13701 TCACCTGAGG TGAGAAGTTC AAGACCAGCC TGGCCAACAT GGTGAAACCC 
13751 TGTCTCTACA AAAATACAAA AATTAGCCAG GCATGATGGC GGGTGCCTGT 
13801 AATCCCAGCT ACCCAGGAGC CTGAGGCGGG AGAAACACTT GAACCTGGAA 
13851 GGCAGAGGTT GTAGTGAGCC AGGATTGCAC CACTGCACTC CAGCCTGGGT 
13901 GACAAGAGCT AGACTCAGTC TAAAAAAAAA AAAAAAAAAC AAACTGGAGA 
13951 TACAGGCTGG GTGCAGGGCT TACACTTATA ATATCAGCAC TTTGGGAGGC 
14001 CTAGGCGGGA GGATTGCTTG AACTCAGGAG TTTCAAGATC AGTCTGGGTA 
14051 ACAGAGCAAG ACCTCATCCC CACAAAAAAT CAAAAATTTA GCCAGGCATG 
14101 GTGGCTCATG CCTGTGGTCC CAGCTACTCA GGAGGCTGAG GCGAGAGGAT 
14151 TGCTTGAGCC CAGGAGGTTG AGGCTGCAGT GAACCATGAC TGCACCACTA 
14201 CATGCCAGCC TGGATGACAG AGCAAGACCC TATCTCAAAA AAAAAAAAAA 
14251 AAAGAAACGA GCCAGGCGCG TTTGCTCACG CCAGTAATCC CAGCACTTTG 
14301 GGAGGCCAAG GCAGGTGGAT CACTTGAGGT CAGGAGATCG AGACTAGCCT 
14351 GGCCAACATG GTGAAACCCC ATCTCAACTG AAAATACAAA AATTAGCCAG 
14401 GCATGGTGGC ATGCTCCTGT AGTCCCAGCT ACTCACTTGG AGGCTGAGGC 
14451 ACGAGAATCG CTTGAACCCA GGAGGCGGAG GTTGCAGTGG GCCAACATCA 
14501 TGTCACTGCA CTCCAGCCTG GGAGACAGAG CGAGACTCTG TCTCAATAAA 
14551 TAAATAAACA TAAAATAAAA TAAAATAAAA TAAAATAAAA TAAAAAAATA 
14601 TGGAGGCCAG CAGGCACGGT GGCTCACGCA TGTAATCCCA GCACTTTGGG 
14651 AGGCCGAGGG GGGCGGATCA CAAGGTCAGG AGATCGAGAC CATCCTGGCT 
14701 AACACAGTGA AACCGCGTCT CTACTAAAAA TACACAAAAT TAGCCAGGCA 
14751- TGGTGGCAGG CACCTGTAGT CCCTGCTACT CAGGAGGCTG AGGCAGGAGA 
14801 ATGGCGTGAA CCCGGGAGGC GGAGCTTGCA GTGAGCTGAG ATCGCGCCAC 
14851 TGCAGTCCAG CCTGGGCGAC AGAGCAAGAC TCTGTCTCAA AAAAAAAAAA 
14901 AAAAATGGAG GTTGGGCGCG GTGGCTCGCG CCTGTAATCC CAGCACTTTG 
14951 GGAGGTCGAG GCGGGCGGAT CACCTGAGGT CAGGAGTTCC AGACCAGCCT 
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15001 GGCCAACATG GTGAAACCTT GTCTCTACTA AAATTACAAA AATTAGCCAG 
15051 GCACGATGGC AGGCACCTGT AATCCCAGCT ACTTAGGAGA CTAAGGCAGG 
15101 AGAATAGCTT GAACCTGGGA GATGGAGGTT GCAGTGTGCT GAGATCGCGC 
15151 CACTGCCCTC CAGTAGAGTG AGA1TCCGTC TCAAAAAAAA AAAAAAAGAA 
15201 GAAATGGAGA TACAAACTTA CTACCTACCT CCTTACAACC TACCCTCACA 
15251 GTATTACTGT GAATAAAAGT GTGTGTAGCA CTGGGAACAC TATTCACAGA 
15301 GCACTCATGA ATGTTTGTTC TTTGTTATTA GTTACTAGAG AGGCAAATGT 
15351 CTGCCAGGGC TGAATAATAT GTGTGAATTG GTGATTGTCG CACATATCTA 
15401 AAGAAGTAGT TAN 1 1 NIC AATTAAAACT TAGTTTAAAA ACCAATATAA 
15451 GGCCGAGCGC AGTGGCTCAC ACCTGTAATC CCAGCACTTT GGGAGGCCGA 
15501 GGTGGGCAGA TCATTTGAGG TCAGGAGTTC GAGACTAGCC TGGCCAACAT 
15551 GGTGAAACCC TGTCTCTGCT AAAAAAAAAA AAAAAGTACA AAAATTAGCC 
15601 AGGCATGATG GCAGGTCCCT GTAATCCCAG CTACTTGGGA GGCCGAGGCA 
15651 GGAGAATTGC TTGAACCCAG GAGGTGGAGG TTGTAGTGAG CCGAGTTTGT 
15701 GCCACTGCAC TTCAGCCTGG GTGACAGAGG GAGACACTGT CTCAAAAAAA 
15751 AAAAAAAAAA ACCAAAACCA ATATAATAAA TAAGTGGCCA GCAATGAAAC 
15801 AGAAAGTGAA AAGTTAGTGA AGCAAAACTA GTACTGTATT CAGATAAAGA 
15851 TGCTGAATCT AGATTTGGTC ACCAGAATAG GGTCCTTTGT GGCAACCTGG 
15901 GCTAGTTTGG CTGACTCACC ACTGCCAGGA TGAAATTTCT TTCAGTGGCT 
15951 ACTCATTTCC CTTTATTTTA AGTCCATGCT CACAGAGCAA CCTTCTGATG 
16001 CCTAATTCAG CTTCCTGGGA TACTTAATM CAGGAAGGGT CTGGAAGTAG 
16051 TACCTGTATA GGGGATATGA GTGTTCTGAT T7TAATAGTC AATTCATAAG 
16101 TGTACAGAGG GnTGATAAA TGGTTAGGTC AGAACCATCA CAGAATGTCT 
16151 ACACCTCTTT GGACATTAGG AAGGTCAAAA ACCTGAAAGG CCAAAAGCTA 
16201 GGCCTAGATT AGGGTCATTC ACCAAGAAAA CATCAGCCTT GAAGAGTTCT 
16251 CTGGGTGGTC CACCAGTCAA CCTTCCTTTG ATCAC ACCTC CTTCCTCGTT 
16301 GCTTCTTTAA GCATTGACCT GTAATGGGTA TGGAATTTTT TGCTCACCTA 
16351 ACTCCTTCCT TTTACAGAGG AAGAAGTTGA AGCCCAGAGA GATTTAATGG 
16401 CTTGCCTAAG ATCACACGCA GA TTTTCTG T TAACCAGGGT GATTTTTCAG 
16451 GTGTTCCCTG CCAGACGAGG GCTTTTTTCC TTGAATTGCC TAGAGATTTC 
16501 TTGAGATATC CGAAGCATTT TTCCCAGTGC AGCCTGGAGA AGGATGTCCC 
16551 TGTCAACACA GCATTTGTTA CTCAATGTTA GACATTCAAT TTTCTAATTA 
16601 GTATCATGGA GCAACAGTGG ATGATTATCT ATAAGGGGTT GCAATTCCAT 
16651 GCTTATGTGC TTACAGCCCA TATAGACAAA TATCAGCTGT TAAAATGACA 
16701 AGGCAGTAGA GATGTGGCCC CAGGACAAAG GCATACTCTG CTGTTAGTGA 
16751 ACACTAGTTG GCCAGCAAAT TTCACATGGG CATATACACG GCCAACTGTA 
16801 GACTTTAGGC ATTTATACCC ATTCAGAGAG CCAAACTGGC AACTAAAGAT 
16851 CAGCATTCTC TTTGGCATTT CAGCTTTGCG TTCTGTTAAA AATCACTGCT 
16901 TGCTTAAATA CCTCTGATAG CTCTTCACTG CCTGTAGGCA ACTCTTTAGC 
16951 CTAGCAGACT TGGTCTTTAG TGCTCTGCCC CTACTCTCTT CCACCATTCT 
17001 GGCCTCCTGT CTAATTGCTG CCCATATGTG CCATGCACTA GAGCTTACAG 
17051 ACCTGCTCAG CGTTATATGA GCATACCATA CTCTTTATGC CTCAGTGCAT 
17101 TTGCACATGT TGTTCCTTCA GGCCAGAATG CCTGTTACTG CCTGGCAATC 
17151 AGCCTATTAG AGTCTGCCAA TACCATCCCA TCTTCTGTGG AGGAGCCCCC 
17201 CGCCAAATCC ACCCATACCT CTCCCCACCA ATCAGAGACT TCTTCTCTCT 
17251 TT6TTATTCT CTTCGTTATT CTCTTCATAC CTCAGTTATA TCCATTTCAG 

17301 TATTTGTTTA CACATCTAGC ATCACTCTTA GAGTGTGAAA TTCTCCAAGT 

17351 GTGGAGCCGT ATCTAGTTTG TCTTTGTATC CCAGAGCTTA GCAAAGTGCC 

17401 TAGAATGTAG TGGGTGCTCA GAGTGTTTGC TGGGTGAATG ATGTATTTGT 

17451 TGMCGACTC TTTGGACACT TGAATAAAGT CCATCCAGTA TGCACCATTA 
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17501 CCATCTCTTC GCTCTACAAT ATTCTTTTAG 6CAA6AGCTT ATCTTTT6AG 
17551 GTGATAAGAT AAGCTCAAAC 7TATGTAGAC TAAGACCTCA GTCTGTAAAT 
17601 GTCATCCCTA AGTCTTAAAC CATCAAAACC AGGGCCTCAA GG AATGGCA T 
17651 GCCTTCTGCA ACTGTAGCAA CCTGCTGTGC TTATTTTGCC GTGTTTTTCA 
17701 TTTTTCCCCC AAAAGCTAGA GTCCCTTCTC CCATGGGCAG TGCTGGAAGT 
17751 GTGCTAACAA ATTCTTTCTG CATACTGCTT ACGATTACAA AAAAAACCCT 
17801 CAGCATCTCA TGCCAGACTT GAGTTAAGGT TGTTTTCTTT TGTGTGTCAG 
17851 CTGTATTCTG GTCATGACTT CCTGATGATG CCCTATAGAG ATTTTGCTGA 
17901 GATCAGAGGG TGCTCCACTG CCATCAGTAG CACTGACTCT TGCAGAAGCA 
17951 CCGTTTCTGA AGTTGGCTAA TGTCATCCCT CACGTTTGTT TGTTTGAAAT 
18001 TTGTTTTAGT TCCAGAGATA GCACTTTCAT GGAATGACGC TATCTTCTAG 
18051 AATCACTTTT 1 1 1 1 1 1 i 1 1 1 TGAGTTGGAG TCTCGCTGTG TCGCCAGGCT 
18101 GGAGTGCAGT GGCACAATCT CAGCTCACTG CAATCTCCAC CTTCCGGGTT 
18151 CAAGTGATTC CCCTGCCTCA GCCTCCCGAG GAGCTGTTAC TACAGGCGCA 
18201 CACCCCCACT CCTGGCTAAT TTTATGTGTT TTAGTAGAGA CGGGGTTTCA 
18251 CCGTGTTGGC CAGGATGGTC TCGATCTCCT GACTTTGTGA TCTGCCTGCT 
18301 TCAGCCTCCC AAAGTGCTGG GATTACAGGT GTGAGTCACC GCGCCTGGCC 
18351 TAGAATCACC TTTTTATACC ATAACGTGAG CACCACTGCC GCGTCACCAA 
18401 GGAAAGAGAG AGGCAGCTAC TGTGGGGTTA CAAATGGGTA AGAGTGGCAC 
18451 CAGGAAGGTG AAAGTCTCTA CTTAGCCAAG GCTTAACAAA ATGTCAATCA 
18501 CCAAACATTT ATTTATTAAG CTACGTTCAG GATAAGAAGA TGAACAAGCT 

18551 ATCTGTACAT TCATTTTCTC GTTTGTAACA AGGTMTGAT AGTGATCTAT 

18601 CCTGCCTGCC TCTGAGGGTT ATTGTGAGM TAAAATGAAA TCMGTGGAA 

18651 AAGCACTTAG GAAAAAGAAA AGCATTGGTT TTCAATTGTT AGTGTGGATC 

18701 AGAAACACTG GGGCTTGTTT AAAATGCAGA TTCTTAGCCC CAGTCTCAGC 

18751 GATTCTGATT CTGTATATCT GAAGTGGGAC TCAGGAATCT TGATTTTCAA 
18801 CAAGCTGACC AGAGGGTCCA ATGCTGCTAT TCCTTTAGTT ACACTTTCAG 
18851 AAATATTACT GTAAATCAAA TGGCAAGAAT AAAATAGTTA TTTGAGGCAG 
18901 TTTTAGTATG TTGGACCTGG AGTCCAAAGA CTTGGGTCAA ACTCCAGCTT 
18951 TGTCAGTTCC TAGACCTGTG ACCTTAAACA GCAACCTTCT CTGTGAACCT 
19001 TAGTTCCCTC AGGAACGGCT CTGGTCACCT CCTGCTGTAC TCCATTGATG 
19051 ACTCACCACA TAAGGCTCCC TGGGAGTCCC CCAAACCTTT GCTCTCTTAA 
19101 CTCCTTTTAC AGCCTCCTAC ATCTCCTGCA GGTGCTGTCT TCTCCTCCTT 
19151 TTTCCAGGCC CTGCTCTGAC ACAGCATTCA TTCTCCTCTG GGAAGGGTTC 
19201 CTTCAATGTG TCTCCAAGCA CATCACACCC AGGAAGGACC CTGTGGCCAT 
19251 ATCTGTCTAT CACCAGATCA AACTACGTGA AGGCAGGCAC TAGGTACTGT 
19301 CAGTGCCCAG CATAGGCCTG GCCCATACCA GGTGTCCACA GATGCCTAGT 
19351 AAAGAAACCT ATGATTCAGG ACCCCCATGA TGAGCAACTA TAGCACTAGA 
19401 ACAGTGATAA TAACTAATGT TTATAATGCA TCTTCAGTTT ACAGAGGGCT 
19451 TTTGTACTCA TCATCTAGTT TAGTTCCTGC AACAACCTCT TGAGGAATAT 
19501 AGCACAAGCA GGACAAGGGA AGCCCAGAGA TGTTAAATAA TTTATCCAAG 
19551 TTTATGCTGC TGGGAAGGGC AGCACTGAAA TTAAAAGAAA AGTTTTCTGA 
19601 GCTCAAATCC CATGCCCTTT CCTCAATGTG AGCTCTAGCA AGGTATTCAG 
19651 GAATCCTGCC TCTACAGTTC AGAGCCTCAA ATTGCTGGGT ATGTTGAGTT 
19701 CTTGTATCTG ATTTTTCTAG ATTTCCTGCC CACATTCTTA CTGTCTGGAT 
19751 ATCAGGAAAG AGTTTATCAA ATGCCTGTGG AAATCCAAGA TAAGGTCTCA 
19801 TGATGAGTAA CCCAGTGAAA ACATGAAGTC AAGTCTAACT AGTCACTACT 
19851 ATTTCACTAC TGCTGACTCC TGATGATCAG CTCCTTTTCT AAGTGCTTAC 
19901 TGTCCACTTA TTCCATCATC TGCCTAGAAT TTA TGTGAA G GA ATCAAAG C 
19951 AAAAGGATCA TAAGGCTTCC TTTTTCCAGT ATGTTTTTCC TCCTTTTTGA 
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20001 AAACTGGGCC AGTTAGCTAT CTCCATTTTT ATTTCATGAA TACATCCCCA 

20051 GCGCCTGGTA TATAGTAGAT ATGGAACATT ACACTTTGGA GATATTGCAC 

' 20101 CCAUCTCCA GTTTCTCCAA AGTTACTAAC AATGGTTCCA TCACTGTGCC 

20151 AACATATTTT C M I I NC AA TATATTGGGA AATAATTCTC CCAGTCTGAA 
20201 AATCTGAACA CATTTCATGT GACTTGGTAT CCTCATATGT CTTGGGCTTC 
20251 GAATTCTCCA TTCCTAGTTT CAAGTTCATG AACTGTAAAA CAAAGGATTA 
20301 GACTAAATCT CTAAAGTTCT ATCCAGATGC CAAATTCTTT TCTCTTTCCA 
20351 TGATACCTAA GATAGATGCC AAATATTGTC TTTTACCTGG TGTTTGTGAA 
20401 CATGACATCA CATTACAGGA GTAGCAGATA CTAAACTCTC ACTCTGTAAA 
20451 ACACTGACTG AGTTCCATGA GCCAGATACT GAAGTGAGCT TGTTCACATA 
20501 TGTTCTCATT TAATGCTCAT AACCCTGTGA AGCTGGGAAT TGCTGGGACA 
20551 TT7TATTTAT TTATTTATTG AGACGGAGTC TGGCTCTGTC ACCTAGGCTG 
20601 GTGTGCAATG GCATGATCTT GGCTCACCGC AACCTCCGCC TCCCGGGTTC 
20651 AAGCGATTCT CTTGCCTCAG CCTCCGCAGT AGCTGGGATT ACGGGGCACA 
20701 CACCACCACA TCCAGCTAAT TTTGTATTTT TAGCAGAGAT GGAGTTTCTC 
20751 CATGTTGGCC AGGTTGGTCA CGAACACTTG ACCTCAAGTG ATCTGCCTGC 
20801 CTCAGCCTCC CAAAGTGCTG GGATTACAGG CATGAGCCAC CATGCCTGCC 
20851 CGGGACCCTT GTTTTAGAAG GATGACTGCT GCTATAATGT AGAAAGTGAT 
20901 TTGGAAGAGG GGAGGAGTGG GGCACGAAAG ATGGTTAGTA GATGGGGGTG 
20951 GTAATGCTTA CCTTTCAGTA TTTGGAGGCT TCGGAGTCCT CAAAAATTCT 
21001 CTTCCTTGAT TGGAGTCCTC CCAGCCAATA GAGGGCTTCA CACAAACAGT 
21051 TTCTTGGGTT TTGAATTGTT TGACCAGAGC TTTCTTCCGA CAAAAGGTTG 
21101 GGGTGATTCA TTCACTTACC ACACCTTGCC TGAACATTCA CTTGGGGCTG 
21151 CCGGTTATGA AGGCTATTGT TCTCCAGCCT GTCACAGACG CTTTGAAGAC 
21201 CTGTGCCTCA GCTGGTTCTA AGGAGTCAGT TTG7TCAGCT CCGTGCCAGG 
21251 TTTCCAACTT ATGAAATGTG CTGGAGATTA ACACCTCTCC TGCCATTTTA 
21301 TCCCTACTAT AATTGCCAGT CAAAGGATTC CTGCAGTTGC CTCTGGCAGC 
21351 CATAACTGAT GAATGTTCTG CCAGCTGCTC TGAGGACCTA GAAGAGCAGT 
21401 7TTCTATCCA GGACCAGTTT CCAAGGGTGG GAGGGTGAAA TATATCCTCC 
21451 AGTGTGACAT TTCATCTCCC AGTGATGGGT GGCTTGGGCC CTTTGAAGTT 
21501 GGCTCTGAGG AACCACACAC TTGGGTCTGA GCAGCCAGCA G CTTATC ACA 
21551 TCTGGTGATC AATCCTTCAA AGGTTCCTCC TGAAGTCTGA ATTTTTGGAG 
21601 GTCAAATGGA TTCCACCTGG GAGGGGCTTC TGCTTCAACT CAGGACATGG 
21651 GGAGAAGGCT GTTCCTCTTC CAGGGGGAGG CAGTTTTCAT GGCATTGAGA 
21701 TGTCCTCTCA CTTATTCCCC ACCCACCCAC CAAGTCCTTT GTAAGAGGAG 
21751 TAGGGGGAGA GGAGAGCGCC TGCAGCCTCC TGCTCACATT CCTAGACACC 
21801 GACTCACTGA GCCCGTCGCC GCTGGAACAG CAGAGCTGTG TGAAATGTCA 
21851 AGAGGAGTTA TGCTCATAGG CTCCCTGGCC TCAGTCTCTT TGTGGCTTGC 
21901 ATATTCTTCC ATTAGTACTG TGTTCATCAC ATGGAAATCA GAGGGTACAA 
21951 TTAAAAGATA ATTTGCTAGT CCCAGACTTA ATTTGGGGCC CCCTTCTTGC 
22001 CTGATTGAAT TACAGGGGAA CATAATAGAT TTTTGGTGAG AAATAGTTGT 
22051 CTGTGTGGCT GGGAGAAAGA TTGCTCCCAG CTCTCCAGCT GGGCAGCCCT 
22101 TTCAGTATCC CGTATGTTAT TTCCCCACTT CCAGCCCACC TCACCTCCTC 
22151 TGTGGCCCTT GTGTGTCCCC TCGGCTAGGA TCCTGACCTC CTGCTCAAGA 
22201 GTTTAAACTC AACTTGAGAC CCAAGGAAAA TAGAGAGCCC TCTGCAACCT 
22251 CATAGGGGTG AAAAATGTTG ATGCTGGGAG CTATTTAGAG ACCTAACCAA 
22301 GGCCCAGACA GAGAGAGTGA CTTGCTAAAG GCCACATAGC TAGCCCACAG 
22351 TAGTTGTAAC AATAGTCTTA ATGATATTAA TGGCTAACAT TTATCAACCT 
22401 TTAATGTGTC CCAGACTTTG TGCCAAGGGC TTACATGCAG TGCATTGTCG 
22451 CATTCAAACC CAGACAGTCT GGCTCTGGGC CCAGGCTGAG CTTTGGTATA 

FIG. 3-9 



U.S. Patent Jan.22,2002 Sheet 15 of 41 US 6,340,583 Bl 



22501 GCATGGTAGA ACGTTGTCTA 
22551 TCACTTCTCA CATTTACAGC 
22601 CCTGTACCTC AGTTGCTTTA 
22651 AATAGTGGGG GTTAAAATTC 
22701 TAATACAGGG TGAGCACCTG 
22751 GGGCTAGAGT GTGGTGTCTT 
22801 TCTGCACAAA CACCAAGAGC 
22851 GTGGTTGAGC TCTGTGG7TC 
22901 TGCCTATTTT AATACGGCCT 
22951 GACAGAGTTT CACTCTTGTT 
23001 AGCTCACCGC AACCTCTGCC 
23051 CCTCTCGAGT AGCTGGGATT 
23101 T7TGTATTTT TAGTAGAGAC 
23151 CGAACTTCCA ACCTCAGGTG 
23201 GGATTACAGG CGTGAGCCAC 
23251 GTAGAGATAG GGTCTTGGTT 
23301 TTCAAGCAGT CCTCCCTCCT 
23351 GAGCCACTAT GCCTGGCCTA 
23401 AAAAAAGCAA AAGAATGCTT 
23451 ATATCAGTGT CCCAGCCTGG 
23501 AATAAAAAAA AATAAGCCAG 
23551 TCAGCACTTT GGGAGGCCGA 
23601 AAGACCAGCC TGACCAATAT 
23651 AAATTAGCCG AGCATGGTGG 
23701 GCTGAGACAA GAGAATTGCT 
23751 CAAGATCGCG ACACTACACT 
23801 ACACGCACGC ACGCACACAC 
23851 TGGTGGCCAG CACGTGTGGT 
23901 GATCACTTGA GCTTAGGTGG 
23951 CTGCACTTTA GCCAGGGCAA 
24001 AAAAAAAAGA AAAAAATCTT 
24051 CATGTCCCTT AGTTTATGTT 
24101 CACAATTGAG TGGCCACGAC 
24151 TTGCTCTCTG GCCCTTTACA 
24201 CATATGTACC AGGTTTGAAA 
24251 CATCTGTAGT CCCAGCTACT 
24301 TCCAGAAGGT CGAGGTCAAG 
24351 CTCCAGCCTG AGTGACAGAG 
24401 AAAAAAAAAA CACCCTCACC 
24451 ACATAACCCC TCAGAACCTA 
24501 GTTTCCTCCT TTTACTGGCA 
24551 CACTTAACAC AGGGCCTAGA 
24601 CTTAACAGTA TTCAAACCCA 
24651 GTGTCCAGTT GGTGGAATGG 
24701 TCTTTATCAG ACTTTCCTGC 
24751 CGGTGACTTC TGGCTCTTTA 
24801 AGAGCTGATG TCACTGCCCC 
24851 TTTCCTCCAG CAGCCTTGCT 
24901 CAAGGGCTTT CTACATGGTA 
24951 GGCTGTTCAG GTGGGCTCCC 



TAATGTCTAG TCTGGGTTCA AATCCTGGCT 
TGAGTGACCT CAGGCAAGTG ATTTAACCTC 
TCTGTAAAGA GAAAAATCAC AGCACTGTGG 
ATTCATACAA GTAGTGCTGC AAGCAATGTT 
TTCAGTGCTT CCTTCTTCTG GCTGCCTCTG 
CGTGGTATAG ATAGATAGAT ATGGCTGAGC 
TGTTCTTCAC TATTAGAGGT AGTAAACAGA 
TAGAACAGAG GCCGGCAAGC TA TGGCCCA T 
GTGATTGATT GAI 1 1 1 1 1 i I TTCTTTTTGA 
GCCCAGGCTG GAATGCAATG GCACGAACTC 
TCCTGGGTTC AAGCGATTCT CCTGTCTCAG 
ACAGGCATGT GCCACCACGC CTGGCTAATT 
AGGGTTTCTC CATGTTGGTC AGGCTAGTCT 
ATCTGCCCGC CTCAGCCTTC CA AAGTGCT G 
CATGACTGGC CTGATTGACT GAI 1 1 1 1 1 IA 
TGTTACCCAG GCTGGTCTCA AACTTCTGGC 
TGGCCTCTCG AATGCTGGGA TTATAGGCAT 
TATGACCTGT GATTTTTAAT GGTTAGGGGA 
TGTGACATGT GGAAATTACA TGAAACTCAA 
GCAACAAAGT GAGACCCTGT CTCTACAAAA 
GGCCGGGCGC AGTGGCTCAC ACCTATAATC 
GGCAAGTGGA TCACCTGAGG TCAGGAGTTC 
GGTGAAACCC TGTCTGTACT AAAAACACAA 
CATGCGCCTG TAGTCCCAGC TACTTGGGAG 
TGAACCTGGG AGGCGGAGGT TGCAGTGAGC 
GCAGCCTGGG CAACAGAGCG AGACTCCGAC 
ACACACACAC ACACACACAC ACGCTGGGTA 
CCCAGGATGC ACTGGAGGCT TAGGTAGGAG 
TTGAGACTAC AATGAACCAT GTTTATACCA 
CAGTGTGAGA CTGAATCTCA AAAGAAAAAA 
TCCATAAGTA AATATCTGTT GGAACATAGC 
TTATATATGG CTGCTTTTGC CCTATAATGA 
AGTCTGTATG GCCTGCAGAG CCTAAGATAT 
GAAAAAGTGC CTTGACCTGT GCTCTAGAGC 
CTCAGCCTCA CAGCTGGGTG TGATGGCACG 
CTGGAGGCTG AGGTGAGAGG ATCACTTGAG 
ATTGTAGTGA GCCATGATGG CATCACCGCA 
AGAGACCCTG ACTCAAAAAA AAAAAAACAA 
ACTTATCAGC TATTTGTCTT GAGAATAGTG 
TTTCCTAATC TGTTAAATGA GGCTGATGAC 
ATTTAAACAT GATGGATAAT AAATGCTAAG 
AGATATTAAC TGCTCAATAA ATGGTAGCTT 
TGTGCTCTTA TCACATGCAT TGTTGTCCCT 
GAAAAGGCTC CCTTGTAACC CCATCTACCA 
CATGGTTCAC AGTAAGAGAT AGAAGCTGCA 
CAATGGTGAG CGGTGTGTGC CTGGTAAGGG 
AAATCCAGTA GTGAGATCTG AGTGTTCTGG 
TTTTCCTTTA CAATCCTGCA GGCAGGGAGA 
GGCTCTGGTT TGGTCATCGT CACAACTGGG 
ATTCCAGATA CCTAGGCTTA TCAATCCCTT 
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25001 TTGGCACCCC AGGCCTTTTT CTCCCTCATG CCCCATTTTT CAGTTTGAAA 
25051 AGCATGGTTA TCACAGGACA AGTAGAAGAA GCTCCACTGT CCACTGAGGC 
25101 CAATGGATGG TGTTCTGCAT GTGAACACTC AGTGAATAGT GAGTGAATGA 
25151 GAGTAACCTG GGCTCCATCC TATTTGCAGA GAGCTTTGGA AAAGATTTTT 
25201 CTCCTTAAAG AGCCAGAATG AAGCCTGGTA GTGGGAGAGC TCCAGCTCTA 
25251 GAGTCACATG AGCCTACATT TAAATTCCAG CCCTGCCACT GACTCCCTTT 
25301 TTGACCTTGA GTGAGTTACC TAATCTCTCT GTACCTCACT TTTCTTGTCT 
25351 GTAGAGTGGG AATAATTCCT GTCTCAGAGA AATAAAAGAG TGCATATAGT 
25401 GTTTGCCACA TGGAGACACA TCAGGTGTAG GTTAATACTC TGGGCCTTGT 
25451 TTCCTTATTT GCAACACAGC CCTGCCCTGG AGTGGAAGTG GCACCTCCCA 
25501 TTGGTCAGCT CTTGAGGCTG TCCCCAGGAC AGGCAGAGGG AGGGAATGAA 
25551 TGGGAGCCCT AGTGCCAGGA CAGAACAGAT GGCAGCTCAG AGCTAGGATG 
25601 GCTCTCTGGA CCTGTCTCTC CTACCAGAGG TCCCCCC GTC TGGTGTGGCT 
25651 CTTCCTGGAC CTGGCATCCT CTGCTTTTTT 1 1 1 1 1 1 1 CCA CCTCCAAGCA 
25701 GAATTACTGT CCTGTAGGCA GCTCCTCTGC TTGAGGACAT CTGGGGCCAG 
25751 ATATGTTCAC ACTCTATCCT GCCTTGCCCT TCCCTGAGCT CAGGATGGAC 
25801 GCTCAATTGG TCCCAGTTAT TGTCTGCAGC GCCTGCCTGC AGCCTCGATC 
25851 CAGCCCAGCT CCACCCCTTG CCTGCAAGGT CTGTTTCCTA ACAGCTGCTC 
25901 CAACCACACA CCTCGGTTCT GCGGGAGCCC CTCCTCTTCC TCCCTCCCTC 
25951 CCTCATTCAG GGGTGGGACT GAAGAAGAAG GCTAACTTGA CAGCAGCGCT 
26001 TCTTTCTTAG CTAGTCACCG GCCCCTGCTC AAGAATGCCA GTGTGTGTGT 
26051 AGCCTCCACA GAGAGGTCGT TTTCTCGGAG TCCAGAGGGG CCGCCTGAGC 
26101 TTCTGAGAAC TAGGGAGGAG CCATCCCAGC CATGAGCCCC TGTGGGAATC 
26151 TGCTGGGGGC CAAGTGGCCT GGAGTCCTCA GGCTCCCGCA GCTGCTCCGG 
26201 AGGGAGAGGT GAGCTCAGGG CAGCCTGCCT GCAGCCAGAG GTGCCGGGAG 
26251 CCCCGGGCCT GTCATGGTGG CCATCTACAG CCGGCCTGAG GCAGTCACAG 
26301 ACGGATTTGC AGCTGAGCCT GTCTATCTGG TGTGGGAAGA AGATGGGGAG 
26351 TTACTTGTCA GTCCCGGCTT ACTTCACCTC CAGAGACCTG TTTCGGTGAG 

26401 TTGGTCTCCG AGTTCCCCTC TCCATCTCTC CTGGCCCCTG GTCCTGAGAG 

26451 GAGGGTGGTC TCCCTAAATC TCCTTCTCAC TTAGTCCTTT ACCATCGGTT 
26501 CTGCCGGGCA GAAGCCAGCG GAGGTTATAC "CCAAGGAGAA TCGGCCTTGT 
26551 GAGGTACCCC CATTATGTCC TGGAAGTGGT GAGGGGAGGG ATATACCCAG 
26601 AAGGAACTTC TTAGGGAGCT CCAGCTCCCC TTCTATCCCA GACAAACCTG 
26651 AAGGAGCCTC CAAAAGATGC CACTGACCTG CCCATTGTAG ATGTTACTGC 
26701 TTCCGGGGGG AATAGCCCAA ATAGAGTGCT GTTTCCAGCT CTCACATGTC 
26751 TTACCTGCGG GCCATGCTGC CTGCCCAGGA A7TTGTCCCA ACAAGCAGGA 
26801 TGGGCAGGTT TTGCCAAACT GTGGAAACTG GCAAGTCCTG GGTGTGGGTA 
26851 GCCTGGTACA CAGTAGGCAC CTTATAAACG TTTGTTCTCT TAATGGCAGG 
26901 CACATTTGCC TCTGGCCTTG AAGGGCTTCT GAGCTCCCAG GTGAATGTAG 
26951 TTGCTGGGGA AAGACCTGGG CGAGTGCTTC TAAGACTGGA GCAATGGGCT 
27001 TTAGAGTGTT CCTGAGCTGC TGGGCCAGCC CCCACACCTC CTCAGTCCCT 
27051 AGGCCTAAGT ACCTCCACGA GCCTCTCTCT GTGGGGCTTC TCAGAGGGAG 
27101 ATGTGGAAAC TCTACCTCTA ACCTGGCTTT CTTTGCTCAT TGCCCCACTC 
27151 CACCTCCCAT AGAAACTCCC CAGGGGGTTT CTGGCCCTCT GGGTCCCTTC 
27201 TGAATGGAGC CATTCCAGGC TAGGGTGGGG TTTGTTTTCA TTCTTTGGGA 
27251 GCAGCCTGTT GTTCCAAAAA GGCTGCCTCC CCCTCACCAG TGGTCCTGGT 
27301 CGACTTTTCC CTTCTGGCTT CTCTAAGCTA GGTCCAGTGC CCAGATCTTG 
27351 CTGCCGGGAT ACTAGTCAGG TGGCCAGGCC CTGGGCAGAA AAGCAGTGTA 
27401 CCATGTGGTT TTGTGGAATG ACCGGACCCT GGTAGATTGC TGGGAAGTGT 
27451 CTGGACAGGG GGAAGGGGGA AGGGAACTGG TCCTCAATGC TGACTCTACC 
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27501 AAGCGCCCTG CTAGACACTT TATCCTTTAA TCTCTCAACA GCCTAAAGAG 

27551 ATTATATATC CCCATTTTAC AGATGAGGCA ACCAGTTTCA ACAGAGTTAA 

27601 CATATGGAGC CTCACTGGGC AGCTTTTTCT GTCTTCCTGA CTTTCTCTCA 
27651 TCCTTCAGGG GGCTGCAGGT TTGTTTTCTT CTCCTAGTGG AGAGGAAATT 
27701 CTCAGGTTTG TTTTCCTCTC CTAGCAGAGA GTAAAAAAAG GGATAGTTTG 
27751 CCTGACTTGT TGAAGGTGTG GCTGAGATTG TTTTCTAAAG AGCCAATGGA 
27801 AATTGATCTT GAGTTTAGGA GAAAGCTTTT ACATGTGGAA TTAAGATGCC 
27851 AAGTGTTGAA GTAGCCACAT TTCAGGTCCT CATTAATTTC TCTTAATCCT 
27901 GGGAAGGCAG CTTAGGAGAA GGGTTGTTCC TTTAGGAGCC AGGAACTATA 
27951 CCCCTTTTAC CCTTGGAGAG GCAGGGAAGC CAGGGAGGAC ACAACTTCTC 
28001 AGGAAGAGGA GAAGCTAGAG CAGATAGTGA ACTCTCAACC TGAACCT7TA 
28051 AGGGCCAGAC CACTAATGCC ACCCAAGTCC ACCTGCCGTT TGTCTTGTTC 
28101 TGTCCCAGGC TTTCTGGAGA ACCTGATCTT CTTGCCCCTA CCCCCAAGCT 
28151 CCGTTTGCCC AGCTAGAGTC TGGGGGGTAC TGACTGACTT TCGTAGACAT 
28201 TCTTCCCTTC CCCAAATAAG AGGCCACATT CCTGAAGTCA CTTCTGAAGA 
28251 GATAGCTGCC ACACAGGGCT CTTTCCCCCC AGGGAGGGAC CACCCAGACC 
28301 CTCTGCTCTC CCAGGTATCC GTTACCACAT CACTACCTGG TCAGAAAGCT 
28351 GTTTCTGCCA TTAGCCCCTC CCTCTTTTAT TATAGGATAT CCTCAAGGGC 
28401 TCCTCTTTGG GCCTCAGTTT CATCCTTGGC AGAAAGTAGA AGCTAGACTT 
28451 CTTGGGCTCC TGAACAGGGT CCTTGCTGGA TTCTGTGAAA CAAATTAAGT 
28501 TCTT6ACCCT AGGCCTCTGG GGGAGTACAA AGTCTATGGG AGTTCTGGGG 
28551 CTGTGGTTGC AAGGAAAGTG ACGCAACCAG ATTCCATGGG GACATGATCA 
28601 GGCGTGACAT GTGAGGGAGG AAGAGGGAGC AAGGGAATGA AGAATACAAC 
28651 TTCTGTGTCC CATACACCCC TGCCTGACAG GCCATACATA CTCAGCAGAG 
28701 AATGCACTGT CTTTCCTACC ACACTAGCGT GAGGAGTGAG CTGCAATTAC 
28751 CACTGTGCTT CCAAGTAAGA AAATACCTCA AATTGGAATT TACAAAAGAG 
28801 GTAAATTAGG GAGTGGCTTT TGTCGGACAT CTTTAAAGCA TtTTTCTTTT 
28851 TATAGAATTT CACTTAATGT CCAATACTGA TTTAATGAGC TTGGGTTTAC 
28901 ACATTATCTC TTGAAGAAAA CAAATGAACC TTTGTGTTCC AAAGCAATCC 
28951 ATG7TTAAAG GGAAAAAATT ATGCATAACT CTGCCCAGCT TCACAGTAAC 
29001 CTTTGGCAGG TGCCTTAGGT CCTCTGGGAC TCTTTTCCTT ATCTGAAAAA 
29051 TGAAGGACTT GGATCAGGTG AATGGTTCCC AGCTCTGCAA CTTATGTGGC 
29101 TCCTCAGAGG CACACAAGCT CTTTTCCATT ATTTGCCAAA TAATGGAGGC 
29151 CCTGTCTTTA ACTGCAGTAC AACTACACAA AATACTTGAA ACTACAGTCT 
29201 TCCTGGTTTT TGGTTGGAAC TGAATCAGTG CACTCTAGCA ACACTTATTT 
29251 CTTGCTGTTC GTAGGCTTCA TTATGTGTTT GGTTAATnT TTAAAACAAC 
29301 AATAACATAT TCCATAATAA TTACAGCTTA ATTGGCAGAC TGTTTCAGTC 
29351 TATAGGATCT GCAGGAAGGA GGAGTAATAA AGGGATTTTT GACTGAGCTC 
29401 TTATGGAACA GAGTCTCTCT AGGCCCCTGT CATATCTGCC CTTCTGGGCC 
29451 CTGGGGAAAA GTTGGCATCC CCAGTTGTGG TGCTCTCCAG GTGCCCTCAG 
29501 GCTGTGGTGG AGGGAGCTTC CCATTCTCTC CTTCAGCCCA CTCAATTCAG 
29551 AGGCTAGGGG CTGAAAGAAG CTTCTCTACA ACTGGCTGTT CACTGGGAGG 
29601 TTAAGGGATG ACCATCCAGC CAGGCCTTCC TCAGGACATG GGAGGGCTTA 
29651 TGCTTTAACA TGTGTAAATC CACTGCAATA ATGACTGGTT CTTTTACCCC 
29701 ATAAGGTTGA GAATTTACCT GTAAACATTT TTGTCTGAAG AATTTGGATG 
29751 TAAGTGAGGG CTGGGCCTCT ATCTTATCTC ACTTGGCTTC TCTCAGCACA 
29801 GCACCTTGCC TGCTTGTTCT TACACATCCT AGATGCACAG TAACTATTTC 
29851 CTAATTATTA GAAATCTATT AGAATCAATT GATTTCAGCT GGGCTTGGTG 
29901 GCTCCTTCCT GTAATCCCAG CACTTTGGGA GGCTAAGGCT GGAGGATCAC 
29951 CTGAGTCCAG GAGTTTAAGA CCAGCCTGGG CAACATAGGG AGACCCTGTC 
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30001 TCTACAAAAA ATAAAAAATT AGCCAGGCAT GGTGGTGTGC ACCTGTAGTC 
30051 CCAGCTACTC AGGAGGCTGA GGCAGGAGGA TCTCTTGAGC CTGGGAGGTC 
' 30101 AGACTACAGT GAGCAATGAT TGTGCCACTG CACTCCAGCC TGGGTGACAG 
30151 AGTAAGACTC TGTCTCTTAA AAAAAAAAAA AAAAAAGTTG ATTTCTATTT 
30201 GGATAGATAA ATAATTCATT TTAGGACCTT TCTTTTTCAC TTACAGAAAT 
30251 CTGT1TCATT CTGGGCTGAG AAGCAGGTCC ATATTGCTAG GCATAGGAGA 
30301 AAAAGGGGTC TGTCTGCATT TGCCCTTGGT GGTCTCAAAT TGGGGAGGGA 
30351 AAGAAATGAA CACTTACTGG CTACCTTCTG TGAGCCAGGC ATCATGCAAG 
30401 ACATCTGTAC ATAATTTAAT TCTCATAACC CCATAAGATA TTATTAGCAA 
30451 TGTACAAGTG AGGAAACTGA GGCTCAGAGT CATGAAGTAA CTGGCCTTGG 
30501 GTGACACAGA TGGTAAATGG CAGAGAAGGA ATATGGATCC AGGTCTTGAA 
30551 AGAGAAAATC TCAACTGATT ATCTTTTTTA AAAAACTCAT ATGTTCTCTG 
30601 CTGACTCAAA AGGTCTCTGT GTGGATCTGG GTTGACCCAC TGAACTGACC 
30651 ATCAGGGTTC CATGCACTTT GTATCTGCCC AAGCCCTCAG AACCCCTCAG 
30701 TAATGTTTTG GAAGATGAGT TTTGGAGGTT GTCCTTAGGC ATAGCCTCAG 
30751 CGTATGTAGG CCTCTAGGTG ATCTCCCCTA ACCTGAGGAT TTCAGCTCAA 
30801 TTCACTCTGG CTCCTCAGGA CAGTGGGATG ACTGGTTCAG ACCTCAGCTT 
30851 TACCACCTCC CAGCTGGGTA CTCTTCTACC TACAGCCAGG GCAGATTTTG 
30901 ACTTTCACTT GAAACTTCCA AAAATTGAM GGTAGAAAAA CAGCCTTGGC 
30951 TTTGGGAAGA ACGTATGATG TCCATGGCCT CTAAGCATCT GAGGTGGGAC 
31001 ATGTTCGAGT AGCACCTTAC AGTTCCAAAG TGTGTTCTGG GTTCTTTGTT 
31051 TAAAAGAACA GAGACTGCTG GGGAATTGAA CACTGTGAAG TATATGAAGG 
31101 AGGAGAATTG TGCTATTTAA CATTCAGTAC TTGGGCTAAA GGAGAAGCAT 
31151 CACGAAGTGT TAACACTCAA AGGGTCTTGA GCTGTCAGGG CTCCAGCTTC 
31201 CTTATTTTCA CAGGTGAGAA TCCTGAGGCT CAGCTGTTGA GATGTGCTGT 
31251 CTCACTCCGG TGACATAGTA CAGTGGATGT GGCTTTGCAG CCAAGCACAC 
31301 ATAGCTTCAC ATTCCAGCTC CATCAATTAT GTATTGGGCA GCTTTGCAGA 
31351 ATGATTTGAC TTTAACTCTG CTTTTCAGTC TTCTGTAAAA CAGGGATAAT 
31401 CCTGCTACCG TAGGGTTGTC AGGATTAGAG ATAATATAAA TAAGGTACCT 
31451 CATATAGGAC CTGGATTATG GCTGGCATTC AATAAATAGT AGCTGTTAAT 
31501 TGATAGCTAA GCTAGAACTC TGAAGTCTAC CATGGCAACT TCTTAAGTGG 
31551 TCTGAGAACC CAGTTGTGTT CTGTGGCAAA ACACAGCTTA GGGATCCATA 
31601 CCCAGCCCTC CTGTCAGCTG TTCACCTTCC AGTTCTTCAG AGACATGTGT 
31651 GGCAGTGACT TTGGCCACAT AGCTGGCTGT GCCCTTTAAA GGCATTCCTT 
31701 GACACAGATA TGTGGACTGG TGACGTTGCT CTCCAGCCAG GTGTTCTTCC 
31751 CAGCAGGCTG GCCTGGCTGT CTCCTGCATG CCTGTACTTG TTTGTCTCCC 
31801 TGCTCCCTCT CCTGGGCCTG GCCAGAGCTA CTTGCAGCAA ACAAAAGCAG 
31851 GATATTGGCA ATGGAAAGGA GGGTGTGTTC TGGTGCTCCC ATGCCCTGCG 
31901 GCGCACATAC CATTGCAAGG GCGTAACAGA GCCCAGGCCT GCATTTGGGT 
31951 GCAAATAAGT CTGCACACAG MGAAAAGAA GGACCTGGTG ACCAGGAGCC 
32001 ATGGAACCCT TGTGCTCCCC TACCTGGGCT ACTGGTTCTT GCCACTCCTA 
32051 CCATTTTCAG TTTGGAAATA TTTGTTAAGG CTTTGCTCTT CCAGGTCCTT 
32101 TGCTTGGTGC TGAGTCTACC AAGAGTAAGT GGGATGCTGT TTTTGTCCTC 
32151 AGGGAGCTAA CAGTCTAGTG AAGAAGAAAG ATGGTTGCCC AGGAACTTCT 
32201 AAGTCAGAAG GCAGGAGGCA AGAAGGAAGC CCCTGCTCCT ACTGCCAGCC 
32251 CTCTGTTGGG CACCCCATAG TTCTTCAGAA CCACATTTAA TCCTCACTGC 
32301 AGGCCAGGCA TAGTGGCTCA CACCTGTAAT CGCAGCACTT CGGGAGGCCA 
32351 AGGCGGGCAG ATCACTTGAG GTCGGGAGTT CGAGACCAGC CTCACCAACA 
32401 TGGGGAAACC CCGTCTCTAC TAAAAATAGA AAAATTAGCC GGGTGTGGTG 
32451 GCATGCGCCA GTAATCCCAG CTACTCAGGA GGCTGAGGTG GGAAAATCAC 
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32501 TTGAACTCGG GAAGCAGAGG TTGCAGTGAG CCGAGATTGT GCCACTGCAC 
32551 TCCAGCCTGG GCGATAAGAG CAAAATTCCA TCTCAAAAAA AAAAAGAAAA 
32601 AAGAAAAAAT CCTCACTGCT ACCTTGAAAG TAGGTGATGA CATTGCCATT 
32651 TCACAAATGA GAAGTGAAGG GGCTAGCCCA AGATCACTTA GGTGGTAAAT 
32701 GGTGGTGCTA AGATTAGAAC CTCAGATCAT CTAGGGAAAA ACACAGATAT 
32751 GCACAGAGTT AAGGGGACCC A6G6TATTGT TTGTCCTCTT GTTTCACAGG 
32801 TGGGGAAACA ACCCAGAGAG GGAAAGGGGC TTGTCCAAGG CAATTTAGCA 
32851 CCCAAGAACT TGAACCCATA TCTCTCTCCT CCTCATTTAG AGCTCATCCC 
32901 ACATGTATCT TATATTGAGA GGAGTGTGAG CCACATACCA AGAACAGTCT 
32951 TCCCCTCTGC CTCCAACCTC ACTGTGCAGT TTTGAGACAC TTCACAGCCA 
33001 TACTCTTCAT GCCATACCCA GCCCTTAAGA CCCTGAAGTT CCCCTTCCAT 
33051 AAGACAAGTA GGAAAAGCTA TAGGGTAAAA ATAGCCATCA GTGTTTGTTG 
33101 AGCACCCAGG AGGAATTGGG CACTCCAGAA AGATAAAGGG ATTCTCAGGG 
33151 ACTTGCTTCT CTAGACTTCC CTAGCTCAGC TGCTTCAACT CATTCCTGCC 
33201 CCTCTTCTCT ACCTCCCGCA GTGCTCAGAA GTAGTAGAAC TCACTGTGGC 
33251 CTCTCACCTT GCATTGTTGA GTTTTATTTA GACTTTCTCT TCCTCAACTC 
33301 TTCATAAGCT CATGAAAGGT GAAGTAGGGT GCCCTGTGTA TTTATCTTTT 
33351 ATATCTGCAG TGCTTAGCAA GTTATAATAA TGCACTTGCC TGGCAAAAGG 
33401 CTTTCTCTCA TACATTAGCT TATTTCCTCT TCACATTGGC TCTTTGTAGT 
33451 AATAGGATGC TATTAGTTAT TTTCAATGAG AGAAAGCTAC TAAGAGAAGT 
33501 TGTCCAGCTA GTGACAGTAA GTGGCTGATA AAGTGAGCTG CCATTACATT 
33551 GTCATCATCT TTAATAGAAG TTAACACATA CTGAGTTTCT ACTATATTGG 
33601 GTCI II I II I l l l l ll ll l l I Mill IMA GAGACGGAAT CTTGCTCTGT 
33651 TGTCCAGGCT GGAACGCAGT GGTGCAATTT TGGGTCACCA CAACCTCCGC 
33701 TTCCCAGGTT CAAGCGATTC TCCTGCCTCA GCCTCCTGAG TAGCTGGGAC 
33751 TACCAGTGCA CGCCACCACG CCCGGCTAAT TTTTGTATTT TTAGTAGAGA 
33801 CAGGGTTTCA CCATGTTGGC CAGGCTGGTC TTGAACTCCT GACCTTGTGA 
33851 TCTGCCCGCC TCAGCCTCCC AAAGTGCTGG GATTACAGGT GTGAGCCACC 
33901 GCGCCCTGCC TATATTAGGA CTTTTATATA AGCTATCTCT AGCTAGCTAG 
33951 CTAGCTAGCT ATAATGTTTT TTGAGACAGA GTCTGACTCT GTCACCCAGG 
34001 CTGGAGTGCA GTGGCGTGAT CTCGACTCAC TGCAACCTCC ACCTCCTGGG 
34051 TTCCAGTGAT TCTCCTGCCT CAGCCTCCCG AGTAG CTGGG ATTATAGGTG 
34101 CATGCCACCA CGCCCAGCTA ATTTTTTGTA TTTTTAGTAG ACCAGGTTTC 
34151 ACCATGTTGG CCAGGCTGGT CTCGAACTCC TGACTTCAAG TGATCCACCC 
34201 GCCTCGGCCT CCCAAAGTGC TGGGATTATA AGCATAAGCC ACTGTGCCCA 
34251 GCTGCTCTCT ATATTTTTAA TACATATTAT TTCCATTAAT TTTCACAGCA 
34301 GTTCATTTTA TAGATGAGGA AACTAGGCCA GAGAAGTAAA ATATCTTGCC 
34351 CAAGATGATG TAACTAGTAA GTGGCAGGAT CAAGATTCAA ACCAAGCAAT 
34401 GTTCAAACCT CTTGGAAGCA AGAATGTGGC CACTGTGGAA GGTGCAAGGC 
34451 CTTGACAACA AGAATAGGGA AAAGAAGGAA CTAGAAGGAA AGAGATGGCA 
34501 TGGGCTCAGC AGGCCAGGGA GCTCTTAGCT GTGTGTGTTG GGAAGCTCAG 
34551 AAGGGAGGAA GAGG7TGTCT GTGCAGGTAA GTCCTGAGAA CACACCAGAC 
34601 TTTTGAGAGG TGGAGCTTCA T AGCCAGGTC ATTAGGGGA G AAGGGAGCTA 
34651 TAG AI 1 1 1 1 1 llllll l ll l llllllllll 1 1 1 1 1 1 1 IAG AGACGGGGTC 
34701 TTACTATGTT GCCCAGGCTG GTCTTGAACT CCTGGGCTCA AGTGATCCTC 
34751 CCACCTCAGC CTCCCAAAGT GCTGGGATTA GAGGCATCAG CCACCCCGCC 
34801 CAGCGAGCTA TGGATCTAAC ATGTACATCT TACACAGTGC TAATAGAATG 
34851 TTGGGTTTCT TCCCCAATAT TTTATTTTGA AAAAAAATTC AAATATATAG 
34901 AAAAGTTGAA AAATGTAGTT CAAAGAACAC CTACATACCT TTCACATAGA 
34951 TTCATGATTT GTTAATGTTA TGCCACTTTG TATATATCTC TCTCCCTCCT 
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35001 ATCTGTATAC TTTTATTTAT TTATTTTTGC TGAACTATTT CAGAGTAACT 
35051 TAAAGGCATC TTGATTTTAC CCTTGAACAG TTCAATATGT TTCTGCTAAG 
35101 AATTCTCCTA TATAAGTCAG ATATCATTAC ATCTAAGAAA ATTCACGGCA 
35151 ATTTTACAAT ATMTATTAT AGT CCAAAT C CATATTTCCT CAGTTGTTCC 
35201 AAAAAATGTT CATGGCTGTT TCCTTTTTTA ATCTA AATTT GAATCCAAGT 
35251 TTGAGGCATT GTATTTGGTT GCTGTGTCTC TAGGGTTTTT AAAATCTGTG 
35301 CCTTTTCTTC TCCCCATGAC TTTTTAGAAG AGTCA AGACC GGTTATTC7T 
35351 ATAGAATAAC CCACATTCTA GATTTGCCTG ATTAGTTTTT TTATACTTAA 
35401 CGTATTTTTG GCAAGAACAT TACATTGGTA ACGCTGTTGG TGATGGGTCA 
35451 GTTTTGAAGA GTGGAGATGA TTAAACTGCT TTTGTTCATT GAAGTATCTG 
35501 TCAAGACCAG AGATCCTTAA CTGGTGCCAT AAATAGGTTT CAGAGAATCC 
35551 TTTATATATA CACCCTGTCC CCCACCTAAA TTATATACAC ATCTTCTTTA 
35601 TATATTCATT TTTCTAGGGG AGGCTTCTTG GCTTTTATCA AATTCTCAGA 
35651 GGGCCCCAAG ACCCAAAGAG GTTATGAAAC ACTAGTCTGT CCACTGAGGC 
35701 AGGCAACACA GAGCTGGTTT CTGGGGCCTT GTTCAGTCTG AACCAGCTTC 
35751 CCTTGGGGAG ATAGCACAAG GCTGTAACTT TGCCCCATCT TGGCTTTGGA 
35801 TCAAAGAGGA CTGTCCATTT TGTTGTCATA CCTAGGAACC AGGGACAGCT 
35851 TATGTGGCCT GGTTCCAGGG ATCCAGGAGA ATTTCAGTTC TTGTCTTGCC 
35901 TTTCAGGTGT TCAGAATGCC AGGATTCCCT CACCAACTGG TACTATGAGA 
35951 AGGATGGGAA GCTCTACTGC CCCAAGGACT ACTGGGGGAA GTTTGGGGAG 
36001 TTCTGTCATG GGTGCTCCCT GCTGATGACA GGGCCTTTTA TGGTGAGTGA 
36051 ATCCCTTCAT ATCTGCCCCT CTTGGTCTTC AGAGTCCATT GACAGTGCTT 
36101 CCAGTTCCCT GTGGCCTGTT AATCTTTTAG TCTTTCCATC AGCCAGGGCA 
36151 TCTCCCTTTA TTTATTCATT CATTCAACTA GCAGGTATCA ATTGAGCACC 
36201 TACTAAGTGA AAGGTAAGAT CCTTCCCTCA AAGACTTAAT AGTTGAACGT 
36251 TGGGAGTGGG AGGAGAGGCA GGCAGAGAGG AGACACAATA TAGTTGGATA 
36301 AGGACCTCCA AGGAGAGTGT TACAGGCTGA GAGGAGGATA TACTTAGGTT 
36351 GTCTTTAGGG AATCAGAAAA GGAGACTCTG GAATAGGCTG GCAGAGAGAG 
36401 GGGCTACCTC CTATACCTGC TCTGGACAAA CGACTTTAAG CATAGTGACA 
36451 GATTTGCCAA CCCTGTATTG GAAGAACTGA TCTTTTTTAG TGGGGATGAT 
36501 TACTTCTGGG GATTTCTTCT CATAACTGAG ACCAAAACAG TTTTGTGCAG 
36551 TCTCAGAAAT GACAGGAGGT ACCAATCTGA CACTTCCTTT GGAAGCTCTA 
36601 GGGCAGAGAG TGAAAGAGTG GATTTTGACG GGGGCCTTGC TTGGAGGTCA 
36651 TTCACCCACC CCTGTCCTCA CTCCAGCAAC AGTGATAACT CACTTCCTTC 
36701 CTCCCTTTGT ACACCCTTCT CCCCACCTGC TCACAGGTGG CTGGGGAGTT 
36751 CAAGTACCAC CCAGAGTGCT TTGCCTGTAT GAGCTGCAAG GTGATCATTG 
36801 AGGATGGGGA TGCATATGCA CTGGTGCAGC ATGCCACCCT CTACTGGTAA 
36851 GATAGTGGTC CTTTGTCTAT CCTCTCCCAT ATAAGAGTGG CTGGCGGGGA 
36901 GGGACAGTGG CAGGGTGAGT TGGGCAGAAG GAGTGTTAGG GTAGTCAGAG 
36951 CATTGGATTC TTACCACAGC AGTGCTCTTA ACCAGCTCTT TAACTTGTAA 
37001 GCAGAATGAT TTACACATGT CTCTACCCTT TTTCCTTACC AACCTTGAAA 
37051 ATGTCTTCAC TCTGCCCTGC AATCCTCCCA GTGGGAGGCA CTCTTCAAGG 
37101 ACGATCCCAG AACATTAAAG TCAAAGACCC CTTAGAGCTC ACCCTGTCCA 
37151 ACCACCTTGG TTGATAAAAG AAGTCAGCCT GGGGCCCATG GAATAGAATA 
37201 GTACAAGGGC AAGGTTCTCA TTGTGAGTCA AAGGTA GAGT GAAGAGAACC 
37251 CAGACCATCT CACCCCAACC CAGGCCAGTG TTTTTCCAAA TATACCACTT 
37301 GCTGCAGATC TAGCTCAGCA CCCCCAGTCC CAGCCCACCC TGAGAACCCA 
37351 GGCTCCTCAT TCTGAGCAGC CAGCTAGAAT CATGACAAAG AGGGTGGTAG 
37401 TGAGACTATG GGTACTGTTG CTTAAAGCCA CATGGTGCAG TGGTTGCTGG 
37451 GGGGCTTCTG TGTGGGACTC TAGCATCTTA TTCCCCCCTG TGCCCTCTCC 
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37501 CCAGTGGGAA GTGCCACAAT GAGGTGGTGC TGGCACCCAT GTTTGAGAGA 
37551 CTCTCCACAG AGTCTGTTCA GGA6CAGCTG CCCTACTCTG TCACGCTCAT 
- 37601 CTCCATGCCG GCCACCACTG AAGGCAGGCG GGGCTTCTCC GTGTCCGTGG 
37651 AGAGTGCCTG CTCCAACTAC GCCACCACTG TGCAAGTGAA AGAGTAAGTA 
37701 TTTTGAGAAC CCTTCAGCAG GGGTTCTTGA GCAGAGTCTG TAAATGGGCC 
37751 TCAGAGGGCT TAGACCTCCA AAGTCTCATG CAGAACTCCC TTTATTCTCA 
37801 TCTCATATCT TTCTCCTGGA CCCCACTATG CTGTAACCGT ACCTGGGCCT 
37851 TGGCACTTAC TGTTCTCTCT GCCCAGGCTA CTTCCTACCC GATACTTAAG 
37901 GCAAGAATCA CTCACCTTTC AGGTGTCAGG TTTCAGGTCA TGTTTGCTCT 
37951 TTGAAATCAT CTGGCTTGAT TATGTGTATT AGTTGTTTAT CTTCTATCCC 
38001 CTCCACTAGA ATGTAAATTC CAGAAGAAAC TTGCTGTCTT ATTCAGTGCT 
38051 GCATGCCCAG GGCTTGGAAG AGTACCTGGC ATATAGTAGG AGTTGATTGA 
38101 TTATTATTTT GTCAGTCGAG AGAATGAATG GAGAAAATGT GGTCCATGGC 
38151 CCAAAAGAAG TTAAGACCCT ATCCTAGATT CAGGCCAGAG ACCAGATGGA 
38201 GAAAGAGTCT GTGTCTATCT AATACCAGTA ATGTCGTACC TCTGGCCGCT 
38251 TACCATGTAA ATATTGATTG TGTATCTACC ATGTGTTGGA CACTAGGCTA 
38301 GTGCTTGCAC AGCAGGTGAA AGATACTAGA GTTTGGGAAG TCAGGAGGAG 
38351 CTAAGGTCTG TTCTACAACC TTATTAGATG AAGAGGAGAG GGAATTGTGT 
38401 TCAGGGCAGA GGGAGAAGCA TTTCTCCAAA AGTAGGAGTC TTAATCATGT 
38451 CTGATGTAGG TTGAGTGTGG CCAGAAAAGG GGCTGTTAAG TATAGAGGGC 
38501 CTGGATTATG AAAATCCAGC AGATCCATTG AGAGTTTAAG CAGCAAGGTG 
38551 TTGTGACCAA GTTAACATTT TAGAAGGATC ACTGGTATGG AGGTTGGATT 
38601 GGAGAGGGGA AAGCCTAAAG GTATAGAGAC TAGTTAGGAA GCTATTGTAG 
38651 GCTGGGCATG GTGGTTCATG CCTGTAATCT CAGCACTTTG GGAGGCTGAG 
38701 GTGGGAGGAT TGCTTGAGGC CAGGAGTTGA AGACCAACCT GGCCAACATA 
38751 GCAAGACCCC GTCTCTGTTT TTCTTAATTA AAAGAAAAGT CCAGACGTAG 
38801 ACATAGTGGC TCACGCCTGT AATGCCAGCA CTTTGGGAGG CCAAGGTGGG 
38851 CAGATTGCTT GAGGTCAAGA GTTTGGGATT AGGCCAGGCG CAGTGGCTCA 
38901 CGCCTGTAAT CCCAGCACTT TGGGAGGCCG AGGTGGGCGG ATCACAAGGT 
38951 CAGGAGATCA AGACCATCCT GGCTAACACA ATGAAACCCC GTCTCTACTA 
39001 AAAGTACAAA AATTAGCCGG GCATGGTGGC GGACGCCTGT AGTCCCAGCT 
39051 ACTCGGGAGG CTGAGGCAGG AGAATGGCGT GAACCTAGGA GGCGGAGCTT 
39101 GCTGTGAGCA GAGATCACGC CACTGCACTC CAGCCTGAGC GACAGAGCGA 
39151 GACTCCATCT CAAAAAAAAA AAAGAGTTTG GGATTAGCCT GGCCAACATG 
39201 GCAAAACCCC ATCTCTACAA AAAGTACAAA AAAATTAGCT GGGTATGGTG 
39251 GTGCGCGCCT GTAATCCCAG TTACTCAGGA GGCTGAGGCA TGAGAATTGC 
39301 TTGAGCCTGG GAGGTGGAGG TTGCAGTGAG CCCAGATCAT GCCACTGCAC 
39351 TCCAGCCTGG ATGACAGAGT AAGATGCCAT CTCAAATAAA AATTAAAAAC 
39401 AAAGTTTAAA AAAAAAATAG AAGCTATTAC CGTGATCCAG GTAAGAGATG 
39451 TGAATAACTA CAATGATGGA AAGAAGGCAG AGTTCTTAGA GATGGGAGTA 
39501 GGAGAGATGA GGGAACTCCA GATTGGGAAG ATGATGTTCA AGTTTCTGGC 
39551 TTAGGCCACA GGGTGAGTGG CAATTCCCTT CACTGAGATG GGGCATCCTG 
39601 GAAAAGGTGT TGCCTTTCTG TGTGGGTATC CTGGGCCCCT TAGGGGCCAC 
39651 TGGTGGCCTG GGACCTGGTA AACCTTCCCT GCACAAGCAG AATTGGTCAA 
39701 GCAGGTTTTT AGGACATCTT TACCCTGCCT CAACTCTTGT CTGGCCCAGG 
39751 GTCAACCGGA TGCACATCAG TCCCAACAAT CGAAACGCCA TCCACCCTGG 
39801 GGACCGCATC CTGGAGATCA ATGGGACCCC CGTCCGCACA CTTCGAGTGG 
39851 AGGAGGTAGA GTGTGTGTCT AATCTGTCTT GTGAGGGTGG GACATGGAAC 
39901 AGATCCTCTG GGAAATCAGG CTGTAGCCTT TACCTTTTCC TACCCCCAGC 
39951 CCATCTCTTT GTCTTAGCAT TGAGCCTGTG ACCACTGGTG ACCTATTTCA 
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40001 GCGTAACAGG TTCCCAGGGT AGCAGGGATG GTTGATGGAC GGGAGAGCTG 
40051 ACAGGATGCC AGGCAGAGGG CACTGTGAGG CCACTGGCAG CTAAAG6CCA 
40101 CCATTAGACA AGTTGAGCAC TGGCCACACT GTGCCTGAGT CATCTGGGTT 
40151 GGCCATGGGT GGCCTGGGAT GGGGCAGCCT GTGGGAGCTT TATACTGCTC 
40201 TTGGCCACAG GTGGAGGATG CAATTAGCCA GACGAGCCAG ACACTTCAGC 
40251 TGTTGATTGA ACATGACCCC GTCTCCCAAC GCCTGGACCA GCTGCGGCTG 
40301 GAGGCCCGGC TCGCTCCTCA CATGCAGAAT GCCGGACACC CCCACGCCCT 
40351 CAGCACCCTG GACACCAAGG AGAATCTGGA GGGGACACTG AGGAGACGTT 
40401 CCCTAAGGTG CCACCTCCCA CCCTGGCTCT GTTCTGTCCT ATGTCTGTCT 
40451 CTCGGATGAA GCTGAGCTGG CTTTCAGAAG CCTGCAGAGT TAGGAAAGGA 
40501 ACCAGCTGGC CAGGGACAGA CTATGAGGAT TGTGCTGACC CAGCTGCCCC 
40551 TGTGGGGATC ACAGTTTACA GCCAGAGCCT GTGCGGACCC AGCTGTCTGC 
40601 CAGGTTTCCT TAGAAACCTG AGAGTCAGTC TCTGTCCACT GAACTCCTAA 
40651 GCTGGACAGG AGGCAGTGAT GCTAAACCCT GAAGGGCAAC ATGGCCTATG 
40701 GAGAAAGCAT GGAGCTCAGA GCCTGGAGTA CGGGCACAGA TAGGATTGAA 
40751 TAAATTGTGT AGAAAGACTT TGAAAACAAT AAAGCAAAAG ATGAATGAAC 
40801 GIN II MIA GACTTGAGGG ACCAACAACC CCCAAACCCC AGATTCTGCC 
40851 AGGTCCATGG GGAAGGAGAA GTTGCCTTGA GTGGAAGCCC CAAGTAGGGA 
40901 GACTTACAGA AAAGAAGTCA AGAGCACTGG CTCCCAGGCA GAAATACTGA 
40951 TACCCTACTG GGGCTTCAGG CTGAGCTCCT CCCTTCACAA AJCACTTCAT 
41001 CTCTCTGAGC CTGTTTCTGC ATCTGTGACA TAAGATGGTA AGATAAAGGT 
41051 GGCTGTCTCA CCAATTATGT AAGGATTAAA TGTGGAAAAG GACATAAAGT 
41101 TGTATAGTGC TGCCATAGGG ACAGTGTTCA GTAAACGTGA CACATTCTTA 
41151 GTATCACTAA GAATCAGGTT CTTGGCCAGG CACCGTGGCT CATGCCTGTA 
41201 ATCCCAACAG TCTGGGAGGC CTAGGTCGGA GGATGGCTTG AACACAGGAG 

41251 TTTGAGACCA GCCT6AGCAA CATAGTGAGA CACTGTCTCT ACAAAAAAAA 

41301 AATAATAATA ATAATTGT1T TTAATTAGAT GGGCAGGGCA CTGTGGCTCA 
41351 CACCTGTAAT CCCAGCACTT TGGGAGGCCA AGGCCGGAGG ATTGCTTGAG 
41401 GCCAGGAGTT CAGGAGCAGC CTGGGCCACA TTCCTGTCTC TACAAAGAAT 
41451 AAAAAAGTTA ACTGGGCATG GTGGCACATG CCTGTAATCC CAGCTACTCA 
41501 AGAGGCTGAG GAGGAGGATT GCCTGAGCCC AGGAGTTCAA GACTGCAGTG 
41551 AGCCTTGATC ACACCACTGT ACTAC AGCTT GGGCAACAGA GTGAGACCTT 
41601 GTCTCCAAAA AAAAAAGTTT G I f 1 1 1 1 1 1 1 ATCCACTCTC CTCACCAAAC 
41651 AAACTGAGTA AGTTAGAGCC CTCTCAGCTG GCATGTGTTG GAAACAGTGC 
41701 CCTCTCATTA AAGTGCTGCC CTCACTCCCA TTGCCTCTTG GCCTTGGTCA 
41751 GTATGATGAA ATTAGTGGGA GGCAGGGCAA CAGAGGGCAG GGAAGAGCTA 
41801 GAAATCCATG GCCTGGAAAA GGGAAGATTT GGGAGTGGCC AGGTATCTGT 
41851 AGAGCCACCA TGCAGAGGAG GGGGGCAGCT AGCCTTGTGT GCTCTGGTGG 
41901 GCATGGTCAG CAGGAGGCAG AGCAAAAGGA CAAGGGTAAG TAAACCTGTA 
41951 GGTCGGGACA AGCCAAGAGC CATCCAGCGT CAGTCCTCTC TGGGTAGCCC 
42001 AAGTAAAGCA GGAGCATACC CCAGAGAGAA AGTTCGCAGG GCTGTTCACC 
42051 TGCAGTGCTG TGGACTTCAA CCTTCTTGTT CCTTCTTCAG TAAGTGAAAA 
42101 TAACAGTCAT TGACCATGAC TATTATCGAC CGCTTTTGAA AATGTAAACA 
42151 TAGTGACTTT ATTGCTGTAA AAATCATACG TGTTTATCAT CTTAAAATTC 
42201 AGGAAACATG GACAGGTACA AAGATGTGCA AAATATCATC CAAAATCCCA 
42251 TTTGCTGGCC AGGCACGGTG GCTCACGCCT GTAATCCCAG CACATTGGGA 
42301 GGCCGAGGCG GGCAAATCAC 7TGAGGTCAG GAGTTTGAGA CCAGCCTGGC 
42351 CAACATGGTG AAACCCTATC TCTACTAAAA ATACAATAAT TAGGCTGGGC 
42401 GCAGTGGCTC ACGCCTATAA TCCCAGCACT TTGGGAGGCC GAGGTGGGCG 
42451 AATCACAAGG TCAGGAGTTT GAGACTAGCC TGGCCAATAT GGTGAAACCC 
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42501 CATCTCTACT AAAAATACAA AAATTAGGGC CGGGTGTGGT GGCTCACGCC 

42551 TGTAATCCCA GCACTTAGGG AGGCCGAGAC AGATGGATCG CGAGATCAGG 

' 42601 AGTTCGAGAC CAACCTAGCC AACATGGTGA AACCCCATCT CTACTAAAAA 
42651 AATACAAAAA TTATTCGGTT GT6GTGGCAC ACGCCTGTAA TCCCAGCTAC 
42701 TTGGGAGGCT GAGGCAGGAG AATCTCTTGA ACCTGGGAGG CAGAGGTTGC 
42751 AGTGAGTGGA GATCCCGCCG TTGCACTCCA GCCTGGGCGA CAGAGTGAGA 
42801 CTCCATCAAA AAAAAAAAAA AAAAAAAAAA AAATTAGCCG GGCGTGGTGG 
42851 CGTGCACCTA TACTCCCAGC TACTTGGGAG GCTGAGGCAG GAGAATCGCT 
42901 TGAACCTGGA AGGCGGAGGT CGCAGTGAGC CGAGATCGTG CCATTGCACT 
42951 TCAGCCTGGG CGACAGAGCG AGACTCTGTC TCAAAAATAA TAATAATAAC 
43001 AATAACTAGC CGGGCCTGGT GGCACATGCC TGTAGTCCCA GTTACTCAGG 
43051 AGGCGGAGGC ATGAGACTCA GGTGAACTAG GGAGACAGAG GTTGCAGTGA 
43101 GCCAAGATCA CACCACTGCA CTCCAGCCTG GTTGACA GAG CGAGACTCTG 
43151 TCTCAAAAAA AAAAAAATCC CATTTGCTCA TTTTTTGGAT ACTAGTATAA 
43201 CTATCACTCT AAACCAGTTA GTACTTAAAT CAAGCAGATA TGGGAGATGG 
43251 TGAATTACCA TCTACAGTGT TGTCATATAT GTCACATACT GAGCATTATC 
43301 AGCTAGTAGA ATCTAGTTAA TTGTTCTATG TGTGATGTAT GCAGAGTTCC 
43351 CATTTTGAAT GTGTTTTTAC TATGCTTAAA TAAATGACTG ATGTCAGCAA 
43401 CCCCAAAATG ATACATCTGA TGTAAGAGCC CCTGTTCCCC AATAATAACA 
43451 TCTAAACTAT AGACATTGGA ATGAACAGGT GCCCCTAAGT TTCCTCCCTC 
43501 CAGGGTTTCT TGGCCGGTCT CTGAGGACTA CACATCCCTA CTCCCGTCTT 
43551 TCCTCATCTT CAGGCGCAGT AACAGTATCT CCAAGTCCCC TGGCCCCAGC 
43601 TCCCCAAAGG AGCCCCTGCT GTTCAGCCGT GACATCAGCC GCTCAGAATC 
43651 CCTTCGTTGT TCCAGCAGCT ATTCACAGCA GATCTTCCGG CCCTGTGACC 
43701 TAATCCATGG GGAGGTCCTG GGGAAGGGCT TCTTTGGGCA GGCTATCAAG 
43751 GTGAGCGCAG GCAACAATTG CTTTGCTCTT CTGCCCCCAG TCCCTCTGTC 
43801 ACTGTCTTTC GGGGATTTCT CATCACTTGG CCCCACCCCA CACCATGCAG 
43851 GATGCCAGGC CTCCTTCCTG GCTTTGGGTG TTGGTGTGAG AGGTATCCTT 
43901 CACCCCCACC CAGGCCACCT AAGGTCAATG TTGCTGTTAC AGTGAGCTTG 
43951 TGGACCTGGA GATCCAGGTT GGGTTGAGCT GTGCCTGTGG CCCTCCTGCC 
44001 TCCAGTCAGT GGGTGTTTGT TAGGTGCCTG CAGACCTCAG TACCGGGCAT 
44051 GCTACAAGGA GCACACAGGG GAATGGCTCC TGCCTCCCTG GTGAACAGTC 
44101 TCAGGGACTA ACCTCTCTCT TTCTCTCCTC CTCCTCCTCT TCTGCTGAGA 
44151 ACTGGGAGGG GGGGTCAGGT AAGACGTGTG TCTCAGCTTG GGGGCAGCAG 
44201 GGCTGGAGAG CTCACCCCCG ATCCACCCAG CTCCCTGGTG CATGTCTTTG 
44251 GCACTGACCT TCCTGCCCCC AGACTTCTGT TCACTCAGGA GACTCACTTC 
44301 TATGCCAAAT GACCAGAGCC CCTGCTTGGC TTGGCAGCAT CCCCTCCTGC 
44351 CTTCTTCCCC ACTTCCCTTT TCTGGGTTCT TGCCTGTCCT CTGTGCATGC 
44401 CCAGCTCTCC AGGAAAGAGG GTTTGCTTCC GTGTGAGTCC CATGTTGCTC 
44451 CACGCTGCAT CTTCCACACA TGAACTCTGT CATTCTGACC CGGCTCAGTG 
44501 TGCCCTCCAA GGGATGGGAT GGCCAGCTGC ATAGATTTTC TCAAACAGTT 
44551 CTCCAGAACT TCCTCTGGTC TCAGCACCAT TAACAGTCAC CCTCCCTGTA 
44601 GGTGACACAC AAAGCCACGG GCAAAGTGAT GGTCATGAAA GAGTTAATTC 
44651 GATGTGATGA GGAGACCCAG AAAACTTTTC TGACTGAGGT AAGAAGATGG 
44701 AGGGGGCCCG GGAGGTTGGT GTCACCATTG GAAGAGAGAA GACCTTACAA 
44751 ATAATGGCTT CAAGAGAAAA TACAGTTTGG AATTACTGTC TTAAAGACTA 
44801 AGCAGAAAAG AGCCCTAGAG GAATATCCCA CTCCCTCTAA ATTACAGCGT 
44851 AATTATTTGT TCAATGAACA CTTACTAAAA GCAACACAAA CAGGGTACAA 
44901 GGGATGCAGT AACAAAAGAT ACAGGGTTCA GAAGAGCTCT CAGGTTATGA 
44951 GGATGATGGA CATGAAAACA CTCCAATTTA GTACAACTCA ATGTTATAAT 
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CCTCACCTGA ACGCCCTGCT AAGGGAGCCT GGAGGGGAGC TCCCTGAGCA 
CTCACACTCC TTGGGCATTT ACAGTTTTCA CTACCCCTCC CAAGTTACTT 
CATGGAGTAA CTTAAGTTGG GGACACCTGT GGTCTGGGTA TTGCCCTCCA 
AGCCACTTGG CCACTCCCAC CCCAGTTCTC CCAATGCAGT TCCAAGGGTA 
AGGCCTATGA AGCCATCTCC ATCTATATGG TGGTGGTCTT CCCTCATCCT 
GATCTTAGTG CCCTGTCATA TCACAAGATA GGAGGTAGGA GATACAGGTG 
GTAACACTTG TCAAGCTGAT TCCTTGGAGG GMGAGGTAA GGAAGACAGT 
GAGAAGTTAA CCACCAGCTT TCCTTGGCTT CCCCCACCCC CAGGTGAAAG 
TGATGCGCAG CCTGGACCAC CCCAATGTGC TCAAGTTCAT TGGTGTGCTG 
TACAAGGATA AGAAGCTGAA CCTGCTGACA GAGTACATTG AGGGGGGCAC 
ACTGAAGGAC TTTCTGCGCA GTATGGTGAG CACACCACCC CATAGTCTCC 
AGGAGCCTTG GTGGGTTGTC AGACACCTAT GCTATCACTA CCCTAGGAGC 
TTAAAGGGCA GAGGGGCCCT GCTTTGCCTC CAAAGGACCA TGCTGGGTGG 
GACTGAGCAT ACATAGGGAG GCTTCACTGG GAGACCACAT TGACCCATGG 
GGCCTGGACC ACGAGTGGGA CAGGGCTCAA CAGCCTCTGA AAATCATTCC 
CCATTCTGCA GGATCCGTTC CCCTGGCAGC AGAAGGTCAG GHTGCCAAA 
GGAATCGCCT CCGGAATGGT GAGTCCCACC AACAAACCTG CCAGCAGGGC 
GAGAGTAGGG AGAGGTGTGA GAATTGTGGG CTTCACTGGA AGGTAGAGAC 
CCCTTCCTAT GCAACTTGTG TGGGCTGGGT CAGCAGCTAT TCATTGAGTT 
TGTCTGTGTC ACTGAAACTG ACCCCAGCCA ACTGTTCTCA GTTCACAGCC 
CTGTTTTCAA AGAATTACAC ATCTCTAAAG GCAAACAGGG CACGGACAAG 
GCAAACTGGA GAGGCAAACT GTAGCCTGAG ATGGCCTGGG CTTGCCATCA 
CAGGTATTCA GGTGCTGAGG GCCCTTAGAC CAACTAGAGC ACCTCACTGC 
CTAGGAAATC AATGAAGGGG AAATGAGTTC TAGCGGAGCC CTGAAGGATC 
AGAATTGGAT AAAGTTCTTA TTGGCAGAGA GGCACCAGGA TTGAAGTGAC 
AGGAGCAAAG ACCTGGGAGG AAAGAGGAGA AAATCATCTA TTTCACCTGG 
AAACAAATGA TTCCAAGCAT AGAAATAATA ACAGCTGACA AGTACTGAGT 
GCCCTCTATA TGCTAGGCAC TGGGCTGAGG GATTAACATG CATGTGCATG 
TTTATTCCTC ATGACAACCT TGGTTTCCAG ATAAGCTGGA CTGGAAAGGG 
ACAGAGCTGG GATCCTGGGC TAATCAGTCT GGTCGCCAAG CCTGAGACTT 
TAGCCACTGC CCTTCACATG GGGGTCCATG AAAATAGTAG TAGTCTGGAA 
CAGTTTGGGG GTACATCAAG GTC GCTGTGT TTTAAGCTAT GGAGTCTGGA 
CTATAGGAGA CAAATGTAAA AGAGTTTTTT GGTTGACTGG CTTTTTGGTT 
TTTTTGTTTG TTTGTTTGTT TGTTTGTTTG TTTGTTTGTT TTTTCCTGTT 
TCTGGGGCTT GAATCAGGAA GGAGGTTTTT TTGTTGTTGT TGTTTTGAGA 
AAGGATATTG CTCTGTTGCC CAGACTGGAG TGCAGTGGCA CGATCATGGC 
TCACTACAGC TTCGACCTCC TGGGCTCAAG CAATCCTCCT GCCTTA GCCT 
CCCAAGTAGC TGGACTACAG GTGTGTACCA CCACACCTAA TTTTTTGAAT 
1 1 1 1 1 I I I CT l ll ll l l lll 1 1 1 1 1 1 1 1 1 1 GGTAGAGACA GGTTCTCACT 
TTGTTGCCCA GGCCTGAATC TCAAACTCCT GGGCTCAAGC ATTCCTCCTG 
CCTCGCCCTC CCAAAGTGTT GGGATTACAG TTGTGAGCCA CCATGC CCGG 
CAGGAAAAGA TTTTTAAGCA AGAAAGCTTA AGAGCTGTGG TTTTTCCAAA 
ATGAGTCTGG GCTGGCACAG TGGCTCATGC CTGTAATCCC AGCACTTTTT 
TGGGAGGCCG AGGTGAGTGG ATCACTTGAG GTCAGGAGTT TGAGACCAGC 
CTGGCCAACT GGTGAAACCC CTGTTTCTAC TAAAGAAAAA AATGCAAAAA 
TTAGCTGGGC GTGGTGGTGC ACGCCTGTAG TCCCAGCTAC TCAGGAGGCC 
GAGGCAGGAG AATAGCTTGA ACCTGGGAGG CAGAAGTTGC AGTGAGCCAA 
GATCACACCA CTGCATTCCA GCCTGGGTGA CAGAGTGAGA CTTCATCTCA 
AAAAAAAAAA AAAAGAGAGA CTGATATGGT TAGTACATTG GGGTGGAATG 
CGGAGGGTCC AGGGAATGGA GCCCTGCATA GGGGGCTAAT GAAACATTTC 



FIG. 3-1 



9 



U. S. Patent Jan. 22, 2002 Sheet 25 of 41 US 6,340,583 Bl 



47501 AGATTTCTGA ATTAAGGTAG TGGCTGTGGG GACAGGAGCC TGGGAGGCAG 
47551 GGTGGAGTCA GMTGGAGAG ACTGGTTGGC AATGAGGGAA CAGGAGGAGG 
47601 AGGAGGAGGA GTTACGAGTG GCTTGAGGTG TCACTTACCA GACATTTGGG 
47651 GGATGGGGGA TAGCCGTGAT TGTTGAGCAA CTGGTTTGGG AAGAGCTAGC 
47701 ATTGATCCCT GCTGTTCTGT GCTAGCAGAA CCTATCAGCA TCTTCTGGGC 
47751 AGGAAACTGG CTCCATGAGA CTGGCTTAGG GAGAGGCTGC TAGTCACCTA 
47801 ATCTGCAGAG AAGGGGCAGC TGGAGCTGTG GGACAGAAGA GGCATCCATG 
47851 TAGCTGGTGG GGGTGTCTCA GCTTGTGAAG AGGAGATGGC TTTGAGCAGG 
47901 GCTGACACTG AAAAGGCTGG AAGAAAAAAA CAGACACACA AGAGTCTCAG 
47951 GATCAGGTAG CATAGGAAAG TTGTGGACAG TCTTTGAGGA GCACTCCCTC 
48001 AGGCAGGCAG GCAGGCAGGT CATGAGCTAT AGCGATTCAG GAAGAGCTCC 
48051 CTGGGTGTGT GAGCAGCTCC AGGAGCCTAA GGGATGAAAG TAGTATTGCA 
48101 GGGGGCTGGA GAGCAAGGAG TGGCTCCTTC TACATTTGCA AGGGAAGGAG 
48151 AAAGGAAGTT GCTCCTGAGA GTGGTAAGAG TCAGTGGTGG AGGCCTGGAG 
48201 AGGAGACATA ACAAACAAAT TTGTTGACAA ACATTTTGGT AGGAAGGGGG 
48251 AGAGCTTAM GTTTAGACAG TGGGGAAGGT GGAGTCTTAG AGGAGGTGAA 
48301 TGTCTGAAAG ACAGAGCTAG CTGGAGCAAG AAGTCACTTC TCTGTTGCAG 
48351 GCAGGAAGGA TCCAAAGTGG CTCAAGCCAG AGATTGGGAG AGTGGGGAGG 
48401 AGGGAGCAGC CTGGATCTAA GTAAAATGGG TAGAGGTGGA GGGGGTGCTG 
48451 CAACGGCCAG GGTTTTCTGA AGTTGGGGAC ATTAGGAGAG AGCTGTGAGG 
48501 GCTTTGGCCA GCCACTGTGC TAGTGATTGG TGAACCAAAG GATGGGCAGG 
48551 AGATGGCAGC AGGGAAGCAG AGGAAGTCCA GGCTTCCTGT TGGTATTGGG 
48601 ACAAGGGAGA GGCCATAGGA GGCCCTGGCC CTGTTGTCCA GGTTGGGTTC 
48651 TGAAGCTGGG TGGGCATGGC CTGGTAGGAG AGCATCTATG GCGCCCAATT 
48701 CCAGATTCAG GGTCTAGTTG ATTTGCTGGC CCTGTAGCCT CAGCTCATGC 
48751 TTCTGTTCCA GGCCTATTTG CACTCTATGT GCATCATCGA CCGGGATCTG 
48801 AACTCGCACA ACTGCCTCAT CAAGTTGGTA TGTCCCACTG CTCTGGGCCT 
48851 GGCCTCCAGG GTCCTATCCT TCCTGGCTTC CTTGTCACAA AGGAGGCTGA 
48901 CTTGTCCCCT CTGGCTAGAG GGCAGAGGTG TTGCCTAGGA GCTCCTATCT 

48951 TTCCCTTCCT GCTTCTTCCA ATGCCCTTCT CTGTCCTCTG GGAGCTCCGA 

49001 GACACACACA GACATAATTT CACCTTCTCT CATTAGCAAC CTTTGAAATA 

49051 ATTTGATTAG AAGGGACTTC AGAAGTTTGT TGACTATATG TAGAAAACCC 

49101 TGTCATTTTA CCTGCTTTTG CCCCATAGTA GTCTTGTAAA ACAGTTCATT 

49151 GCTGACCCCA TTTTACAGTG GTGGCACCTG MGCCTCAGC CTGAGGCCAC 

49201 CGAGCTAGTA AATTTACAGG GACCAGTTTG AGACCAGCAT TCCTCCCACT 
49251 GCCCCTCAGC TGTGGTGGTT ACAATGTTGT TTGTCTTACT GACTTGCTAT 
49301 CTGGCTTCCT GGGTGTCTAC CGGCTGGCCC TGGCTCTGCC CTCTAGACCC 
49351 ACACCACGCA ATCTTCATTC CTTTCCCACA TGACTGCCCT GTAGCTATTC 
49401 AAAGAGCTTG TCTCCCCCAA GTCTCCCCAT CTACTGGCTC CACCTTGCCT 
49451 TTTTCTGTCT TATCCTGGTT CTAGCCACTG CCTGAAATCA TTTTAGGAAT 
. 49501 AAGACAGGAC AGGGAAAAAC AAAAGCAACC CCCTGTCCCA CCTCTGAGTT 
49551 CCACTCTCCA AGTCCCTGAG CCTCACCTCC AGGGCTCCAG TGGCTCTGCC 
49601 ATGAACCCAC TGTGGGCTGG GAGTCTGCTG TGCACAGATA CCAGACCCTC 
49651 AGAAACACAA ATGCCAAGTG TGTCTGTTTT TTTGTTTTGT TTTGTTTTGT 
49701 TTTTTAGATG GAGTCTCATT CTGTTTCCCA GGCTGGAGTG CAGTGGTGCA 
49751 ATCTTGGCTT ACTGCAGCCT CTACCTCCCG GGTTCTAGTG ATTGTTCTGC 
49801 TTCAGCCTCC CAGTAGCTAG GAC TACAGG C GTGTGCCACC A CGCCCA GCT 
49851 AAI I II III I llllllllll TGTATTTTTA GTAGAGACAG GGTTTTGCCA 
49901 TGTTGGCCAG GCTGGTCTTG AACTCCTGAC CTCAGGTGAT TCACCCGCCT 
49951 TGGCCTCCCA AAGTTCTGGG ATTACAGGTG GAAGCCACCG TGCCTGGCCT 
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50001 GAGTGTGTCT ATTTGATAGA GCTTTCTGCT CTGATTCTCC CTTGCTATAC 

50051 ACCTT7TCTC CCCTTCTCAG TGGCTTCTCT TGCCTATGCT TCCTCCCCAG 

50101 GGCCAGGTTT GAGAACATCC CCATGAAGTC CTGACCTGTC TTTTATCCTA 
50151 CCAGGACAA6 ACTGTGSTGG TGGCAGACTT TGGGCTGTCA CGGCTCATAG 

50201 TGGAAGAGAG GAAAAGGGCC CCCATGGAGA AGGCCACCAC CAAGAAACGC 
50251 ACCTTGCGCA AGAACGACCG CAAGAAGCGC TACACGGTGG TGGGAAACCC 

50301 CTACTGGATG GCCCCTGAGA TGCTGAACGG TGAGTCCTGA AGCCCTGGAG 
50351 GGGACACCCG CAGAGGGAGG ACAGATGCTG CCCTTGCATC AGAGCCCTGG 
50401 GAATTCCAGG GGAGGCCTGT GAAGCGTAGG ACCGGATACC CAGAGCTGAG 
50451 GATATTTTTC CCTTGCCAGG TGGGGCCTCA CGATTTAGCT CCTGAGCTCA 
50501 GGGGGCTGGG AACTGATCAG TGTCCCATCA TGGGGGATAA GGTGAGTTCT 
50551 GACTGTGGCA TTTGTGCCTC AGGGATCGCT AAGAGCTCAG GCTATTGTCC 
50601 CAGCTTTAGC CTTCTCTCTC CATGGTGAGA ACTGAAGTGT GGTGCCCTCT 
50651 GGTGGATAAT GCTCAAACCA ACCAGAGATG CTGGTTGGGA TTCTTGAAAT 
50701 CAGGGTTGTG AGGCCTCAGA AATGGTCTGA ATACAATCCA TTTTGGAGTC 
50751 TGAGGCCCAG AGAAGTTCAG TGAATTGCCT AGGAGCATAC AGCTGCCTAA 
50801 TGGCAGAGGC TAGATGAACC CTAGTCTGGT TCTTTTCCAC TTTAACGTGC 
50851 AGTTTCATCC TAGGCAGTGT TATGTTATAA GGGCTCTCCA AGGCAGTTCA 
50901 CCTACGGCTG AGGAAGGACT ATTTTCAGGT GGTGTCTGCG CAGGACAGCC 
50951 TGTGGGGTGT CCCTACAGAA CCTGTTCTAG CCCTAGTTCT TAGCTGTGGC 
51001 TTAGATTGAC CCTAGACCCA GTGCAGAGCA GGTAAGGGAT GTAAACTTAA 
51051 CAGTGTGCTC TCCTGTGTTC CCCAAGGAAA GAGCTATGAT GAGACGGTGG 
51101 ATATCTTCTC CTTTGGGATC GTTCTCTGTG AGGTGAGCTC TGGCACCAAG 
51151 GCCATGCCCG AGGCAGCAGG CCTAGCAGCT CTGCCTTCCC TCGGAACTGG 
51201 GGCATCTCCT CCTAGGGATG ACTAGCTTGA CTAAAATCAA CATGGGTGTA 
51251 GGGTTTTATG GTTTATAACG CATCTGCACA TCTTTGC CAC GTTCGTGT7T 
51301 CATTGGTCTT AAGAGAAGGA CTGGCAGGGT 1 1 1 1 1 IGTTT TAGATGGAGC 
51351 CTCACTTCGT TGCCCAGGCT GGAGTGCAGT GGCACAATCT GGGCTCACTG 
51401 CAACCTCTGC CTTCTGGGTT CAAGTGA1TC TCCTGCCTCA GCCTCCCAAG 
51451 TAGCTGGGAC TACCGGCACA CACCACCATG CCCQGCTAAT TTTTGTATTT 
51501 TTAGTAGAGA CAGGGTTTCA CCATGTTGGC CAGGCTGGTC TTGAACTCCG 
51551 GACCTCAGGT GATCCGCCTG CCTCAGCCTC TAAAAGTGCT GGAATTAATA 
51601 GGCGTGAGCT ACCTCGCCCG GCCAGGTTTT llllllllll TTTTTAGTTG 
51651 AGGAAACTGA GGCTTGGAAG AGGGCAGTGG CTTGCACATG GTCGATAAGG 
51701 GGCAGATGAG ACTCAGAATT CCAGAAGGAA GGGCAAGAGA CTGTTCATGT 
51751 GGCTGTCTAG CTAGCTCTTG GGCCAAATGT AGCCCTTCTC AGTTCCCTTC 
51801 AAGTAGAAGT AGCCACTCTA GGAAGTGTCA GCCCTGTGCC AGGTACCACG 
51851 TGGACAGAGT GAGGAATCTT GGAAAGATTC CTACCTTTAG GAGTTTAGTC 
51901 AGGTGACAGC ATATCTCAGC GACTCAAACA CACACACATT CAAAGCCTTC 
51951 TGTAATTCCT ACAAAGTTGT GAGGGGTAGA GGAGAGGAGA GACAAGGGAT 
52001 GGTTAGGATA ATGAAGGAAT GTTTTGTTTT TGTTTTTGTT TTTGAGATGG 
52051 AGTTTCACTC TGTCACCCAG GCTGGAGTGC AGAGGTGCAA TCTTGGCTCA 
52101 CTGCAGCCTC CGCCTCCCAG GTTCAAGCAA TCCTCCTGCC TC AGCCT CCC 
52151 AAGTAGCTGG GACTACAGGT GTGCGCCACC ACGCCTGGCT AATTTTTGTA 
52201 TTTTCAGTAG AGACAGGGTT TCGCCATATT GGCCAGGCTG GTCTCAAATG 
52251 CCTGACCTCA GGTGATACAC CCGCTTCAGC CTCCCAAAGT GCTGAGATTA 
52301 CAGGCATGAG CTACCGTGCC TGGCCATGAA GGAAGATTTG TTTTAAAAAA 
52351 TTGTTTTCTT TAATATTAAT TGAACACCTC TGTTCAGAGC ACTGGGCTGG 
52401 TGCCAGAGGG TTTCAGACAT GAATCAGATC CAGCACCTCA TAGAGCCTTA 
52451 ATCTGGCACA CACACACAGC CACAAGGAGA CACAGACAAG GCAGGGTAGG 
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ATGA6TGGAA GCTAGGAGCA GATGCTGATT TGGAACACTT GGCTTCTGCA 
GTGAAGCCCC TTCTTAGTCC TCTTCAGTAA CCCAGCTCTC AGTGGATACA 
GGTCTGGATT AGTAAGATTT GGAGAGATGA TTGGGGATTG GGGAGAGCTC 
TCTAACCTAT TTTACCACCT CCTCTTCTGC CATTCTTCCT GTCCACATCC 
CCAGCATCCC TTTCCCTTGC GAAGTATCTG TGGCCTCTGT AGTCCTTTGT 
AAACAGCTGT CTTCTTACCC TACAGATCAT TGGGCAGGf G TATGCAGATC 
CTGACTGCCT TCCCCGAACA CTGGACTTTG GCCTCAACGT GAAGCTTTTC 
TGGGAGAAGT TTGTTCCCAC AGATTGTCCC CCGGCCTTCT TCCCGCTGGC 
CGCCATCTGC TGCAGACTGG AGCCTGAGAG CAGGTTGGTA TCCTGCCTTT 
TTCTCCCAGC TCACAGGGTC CTGGGACGTT TGCCTCTGTC TAAGGCCACC 
CCTGAGCCCT CTGCAAGCAC AGGGGTGAGA GAAGCCTTGA GGTCAAGAAT 
GTGGCTGTCA ACCCCTGAGC CATCTGACAA CACATATGTA CAGGTTGGAG 
AAGAGAGAGG TAAAGACATA GCAGCAAGTA ATCTGGATAG GACACAGAAA 
CACAGCCATT AAAAGAAAGT TTAAAAGAAG GAAATTCACC CAAACCATTT 
GAATACAGTA AGTGTATTCA TCTTTCGATA TTCCCCTGTC CATATCTACA 
CATATACTTT TTTTTATAGT AAATAGTTCT GTA TTTTGC C CTGCATTTCC 
CTTGTGTTTA CTATCCAGTC TTCCTGTTTA TCATTTTTGT CGACAACATG 
AAATTCTATT GAGAGACTGT CTGAACATAT TGTAATGTAG ATGTTCAGGT 
TTTTCCAGTT TCTCTTTACA ATAGGTATTT AACTACAGTG AGCAGTTTTA 
TGCATTTAGC TAATTTCTCC TTTGAGGAAG TATTTTCAAA ATTACCTTTA 
TTCTTCTCAG GTAATAATTT CATTATTACC AAAGTTACCC TAGGTCTTTT 
CAAGTGTGTG GTTAAAAAAC GAGAATCTGG CTGGGCGCGA TGGCTCACAC 
CTGTAATCCC AGCACTTTGG GAGGCTGAGG CTGGTGGATC ACCTGAGGTC 
TGGAGTTCGA GACCAGCCTG GCCAACATGG TGAAACCCCA TCTCTACTAA 
AAATACAAAA CTTAGCCAGG CATGGTGGCA GGTGCCTGTA ACCCCAGCTA 
CTTGGGAGGC TGAGGCAGGA GAATTGCTTG AACCCAGGGG CGGAGGTTGC 
AGTGAGCCGA TATCACGCCA TTGCACTCCA GCCTCGGCAA CAAGAGTGAA 
ACTCTGTCTC AAAAATGGGG TTCTTTTCCT GCCATCAAAA ATCATGTTTC 
TTTTAAAAAC AAGTTCAAAC ATTACCAAAG TTTATAGCAC AGGAAATACG 
TCTTCTGTAA TCTCCCTTAA CCAATATATC CCTCAACATT CTCCTCACCC 
CCAACTCCAC CCTCCCAGGA TAACCAGTTG GGACATAATC TTTATTTAAA 
AATGGTTTCC GGATAGAGAA AGCGCTTCGG CGGCGGCAGC CCCGGCGGCG 
GCCGCAGGGG ACAAAGGGCG GGCGGATCGG CGGGGAGGGG GCGGGGCGCG 
ACCAGGCCAG GCCCGGGGGC TCCGCATGCT GCAGCTGCCT CTCGGGCGCC 
CCCGCCGCCG CCCTCGCCGC GGAGCCGGCG AGCTAACCTG AGCCAGCCGG 
CGGGCGTCAC GGAGGCGGCG GCACAAGGAG GGGCCCCACG CGCGCACGTG 
GCCCCGGAGG CCGCCGTGGC GGACAGCGGC ACCGCGGGGG GCGCGGCGTT 
GGCGGCCCCG GCCCCGGCCC CCAGGCCAGG CAGTGGCGGC CAAGGACCAC 
GCATCTACTT TCAGAGCCCC CCCCGGGGCC GCAGGAGAGG GCCCGGGCTG 
GGCGGATGAT GAGGGCCCAG TGAGGCGCCA AGGGAAGGTC ACCATCAAGT 
ATGACCCCAA GGAGCTACGG AAGCACCTCA ACCTAGAGGA GTGGATCCTG 
GAGCAGCTCA CGCGCCTCTA CGACTGCCAG GAAGAGGAGA TCTCAGAACT 
AGAGATTGAC GTGGATGAGC TCCTGGACAT GGAGAGTGAC GATGCCTGGG 
CTTCCAGGGT CAAGGAGCTG CTGGTTGACT GTTACAAACC CACAGAGGCC 
TTCATCTCTG GCCTGCTGGA CAAGATCCGG GCCATGCAGA AGCTGAGCAC 
ACCCCAGAAG AAGTGAGGGT CCCCGACCCA GGCGAACGGT GGCTCCCATA 
GGACAATCGC TACCCCCCGA CCTCGTAGCA ACAGCAATAC CGGGGGACCC 
TGCGGCCAGG CCTGGTTCCA TGAGCAGGGC TCCTCGTGCC CCTGGCC CAG 
GGGTCTCTTC CCCTGCCCCC TCAGTTTTCC ACTTTTGGAT 1 1 1 1 1 IA7TG 
7TATTAAACT GATGGGACTT TGTGTTTTTA TATTGACTCT GCGGCACGGG 
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55001 CCCTTTAATA AAGCGAGGTA GGGTACGCCT TTGGTGCAGC TCAAAAAAAA 
55051 AAAAAAAAAT GATTTCCAGC GGTCCACATT AGAGTTGAAA TTTTCTGGTG 
55101 GGAGAATCTA TACCTTGTTC CTTTATAGGC CAAGGACCGC AGTCCTTCAG 
55151 TAACACCAGT GTAAAAGCTT GAGGAGAAAT TGTGAAGCTA CACAGTATTT 
55201 GTTTTCTAAT ACCTCTTGTC ATTCTAAATA TCTTTAATTT ATTAAAAAAT 
55251 ATATATATAC AGTATTGAAT GCCTACTGTG TGCTAGGTAC AGTTCTAAAC 
55301 ACTTGGGTTA CAGCAGCGAA CAAAATAAAG GTGCTTACCC TCATAGAACA 
55351 TAGATTCTAG CATGGTATCT ACTGTATCAT ACAGTAGATA CAATAAGTAA 
55401 ACTATATTGA ATATTAGAAT GTGGCAGATG CTATGGAAAA AGAGTCAAGA 
55451 CAAGTAAAGA CGATTGTTCA GGGTACCAGT TGCAATTITA AATATGGTCG 
55501 TCAGAGCAGG CCTCACTGAG GTGACATGAC ATTTAAGCAT AAACATGGAG 
55551 GAGGAGGAGT AAGCCTGAGC TGTCTTAGGC TTCCGGGGCA GCCAAGCCAT 
55601 TTCCGTGGCA CTAGGAGCCT GGTGTTTCCG ATTCCACCTT TGATAA CTGC 
55651 ATTTTCTCTA AGATATGGGA GGGAAGTTTT TCTCCTATTG TTTTTAAGTA 
55701 TTAACTCCAG CTAGTCCAGC CTTGTTATAG TGTTACCTAA TCTTTATAGC 
55751 AAATATATGA GGTACCGGTA ACATTATGCC CATTTCTCAC AGAGGCACTA 
55801 CTAGGTGAAG GAGTTTGCCT GACGTTATAC AACCAGGAAG TAGCTGAGCC 
55851 TAGATCCCTT CCACCCACCC CATGGCCCTG CTCATGTTCC ACCTGCCTCT 
55901 AA7TTACCTC TTTTCCTTCT AGACCAGCAT TCTCGAAATT GGAGGACTCC 
55951 TTTGAGGCCC TCTCCCTGTA CCTGGGGGAG CTGGGCATCC CGCTGCCTGC 
56001 AGAGCTGGAG GAGTTGGACC ACACTGTGAG CATGCAGTAC GGCCTGACCC 
56051 GGGACTCACC TCCCTAGCCC TGGCCCAGCC CCCTGCAGGG GGGTGTTCTA 
56101 CAGCCAGCAT TGCCCCTCTG TGCCCCATTC CTGCTGTGAG CAGGGCCGTC 
56151 CGGGCTTCCT GTGGATTGGC GGAATGTTTA GAAGCAGAAC AAGCCATTCC 
56201 TATTACCTCC CCAGGAGGCA AGTGGGCGCA GCACCAGGGA AATGTATCTC 
56251 CACAGGTTCT GGGGCCTAGT TACTGTCTGT AAATCCAATA CTTGCCTGAA 

56301 AGCTGTGAAG AAGAAAAAAA CCCCTGGCCT TTGGGCCAGG AGGAATCTGT 

56351 TACTCGAATC CACCCAGGAA CTCCCTGGCA GTGGATTGTG GGAGGCTCTT 

56401 GCTTACACTA ATCAGCGTGA CCTGGACCTG CTGGGCAGGA TCCCAGGGTG 
56451 AACCTGCCTG TGAACTCTGA AGTCACTAGT CCAGCTGGGT GCAGGAGGAC 
56501 TTCAAGTGTG TGGACGAAAG AAAGACTGAf GGCTCAAAGG GTGTGAAAAA 
56551 GTCAGTGATG CTCCCCCTTT CTACTCCAGA TCCTGTCCTT CCTGGAGCAA 
56601 GGTTGAGGGA GTAGGTTTTG AAGAGTCCCT TAATATGTGG TGGAACAGGC 
56651 CAGGAGTTAG AGAAAGGGCT GGCTTCTGTT TACCTGaCA CTGGCTCTAG 
56701 CCAGCCCAGG GACCACATCA ATGTGAGAGG AAGCCTCCAC CTCATGTTTT 
56751 CAAACTTAAT ACTGGAGACT GGCTGAGAAC TTACGGACAA CATCCTTTCT 
56801 GTCTGAAACA AACAGTCACA AGCACAGGAA GAGGCTGGGG GACTAGAAAG 
56851 AGGCCCTGCC CTCTAGAAAG CTCAGATCTT GGCTTCTGTT ACTCATACTC 
56901 GGGTGGGCTC CTTAGTCAGA TGCCTAAAAC ATTTTGCCTA AAGCTCGATG 
56951 GGTTCTGGAG GACAGTGTGG CTTGTCACAG GCCTAGAGTC TGAGGGAGGG 
57001 GAGTGGGAGT CTCAGCAATC TCTTGGTCTT GGCTTCATGG CAACCACTGC 
57051 TCACCCTTCA ACATGCCTGG TTTAGGCAGC AGCTTGGGCT GGGAAGAGGT 
57101 GGTGGCAGAG TCTCAAAGCT GAGATGCTGA GAGAGATAGC TCCCTGAGCT 
57151 GGGCCATCTG ACTTCTACCT CCCATGTTTG CTCTCCCAAC TCATTAGCTC 
57201 CTGGGCAGCA TCCTCCTGAG CCACATGTGC AGGTACTGGA AAACCTCCAT 
57251 CTTGGCTCCC AGAGCTCTAG GAACTCTTCA TCACAACTAG ATTTGCCTCT 
57301 TCTAAGTGTC TATGAGCTTG CACCATATTT AATAAATTGG GAATGGGTTT 
57351 GGGGTATTAA TGCAATGTGT GGTGGTTGTA TTGGAGCAGG GGGAATTGAT 
57401 AAAGGAGAGT GGTTGCTGTT AATATTATCT TATCTATTGG GTGGTATGTG 
57451 AAATATTGTA CATAGACCTG ATGAGTTGTG GGACCAGATG TCATCTCTGG 
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57501 TCAGAGTTTA CTTGCTATAT AGACTGTACT TATGTGTGAA GTTTGCAAGC 

57551 TTGCTTTAGG GCTGAGCCCT GGACTCCCAG CAGCAGCACA GTTCAGCATT 

-57601 GTGTGGCTGG TTGTTTCCTG GCTGTCCCCA GCAAGTGTAG GAGTGGTGGG 

57651 CCTGAACTGG GCCATTGATC AGACTAAATA AATTAAGCAG TTAACATAAC 

57701 TGGCAATATG GAGAGTGAAA ACATGATTGG CTCAGGGACA TAAATGTAGA 

57751 GGGTCTGCTA GCCACCHCT GGCCTAGCCC ACACAAACTC CCCATAGCAG 
57801 AGAGTTTTCA TGCACCCAAG TCTAAAACCC TCAAGCAGAC ACCCATCTGC 
57851 TCTAGAGAAT ATGTACATCC CACCTGAGGC AGCCCCTTCC TTGCAGCAGG 
57901 TGTGACTGAC TATGACC7TT TCCTGGCCTG GCTCTCACAT GCCAGCTGAG 
57951 TCATTCCTTA GGAGCCCTAC CCTTTCATCC TCTCTATATG AATACTTCCA 
58001 TAGCCTGGGT ATCCTGGCTT GCTTTCCTCA GTGCTGGGTG CCACCTTTGC 
58051 AATGGGAAGA AATGAATGCA AGTCACCCCA CCCCTTGTGT TTCCTTACAA 
58101 GTGCTTGAGA GGAGAAGACC AGTTTCTTCT TGCTTCTGCA TGTGGGGGAT 
58151 GTCGTAGAAG AGTGACCATT GGGAAGGACA ATGCTATCTG GTTAGTGGGG 
58201 CCTTGGGCAC AATATAAATC TGTAAACCCA AAGGTGTTTT CTCCCAGGCA 
58251 CTCTCAAAGC TTGAAGAATC CAACTTAAGG ACAGAATATG GTTCCCGAAA 
58301 AAAACTGATG ATCTGGAGTA CGCATTGCTG GCAGAACCAC AGAGCAATGG 
58351 CTGGGCATGG GCAGAGGTCA TCTGGGTGTT CCTGAGGCTG ATAACCTGTG 
58401 GCTGAAATCC CTTGCTAAAA GTCCAGGAGA CACTCCTGTT GGTATCTTTT 
58451 CTTCTGGAGT CATAGTAGTC ACCTTGCAGG GAACTTCCTC AGCCCAGGGC 
58501 TGCTGCAGGC AGCCCAGTGA CCCTTCCTCC TCTGCAGTTA TTCCCCCTTT 
58551 GGCTGCTGCA GCACCACCCC CGTCACCCAC CACCCAACCC CTGCCGCACT 
58601 CCAGCCTTTA ACAAGGGCTG TCTAGATATT CATnTAACT ACCTCCACCT 
.58651 TGGAAACAAT TGCTGAAGGG GAGAGGATTT GCAATGACCA ACCACCTTGT 
58701 TGGGACGCCT GCACACCTGT CTTTCCTGCT TCAACCTGAA AGATTCCTGA 
58751 TGATGATAAT CTGGACACAG AAGCCGGGCA CGGTGGCTCT AGCCTGTAAT 
58801 CTCAGCACTT TGGGAGGCCT CAGCAGGTGG ATCACCTGAG ATCAAGAGTT 
58851 TGAGAACAGC CTGACCAACA TGGTGAAACC CCGTCTGTAC TAAAAATACA 
58901 AAAATTAGCC AGGTGTGGTG GCACATACCT GTAATCCCAG CTACTCTGGA 
58951 GGCTGAGGCA GGAGAATCGC TTGAACCCAC AAGGCAGAGG TTGCAGTGAG 
59001 GCGAGATCAT GCCATTGCAC TCCAGCCTGT GCAACAAGAG CCAAACTCCA 
59051 TCTCAAAAAA AAAAA (SEQ ID NO: 3) 
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DNA 

Position 

941 GAGTMGTGGGTGGTCAGGTTACAGACTTMTTTTGGGTTAMAAGTAAAAACAAGAAAC 
AAGGTGTGGCTCTAAAATAATGAGATGTGCTGGGGGTGGGGCATGGCAGCTCATAAACTG 
ACCCTGAAAGCTCTTACATGTMGAGTTCCAAAAATATTTCCAAAACTTGGAAGATTCAT 
TTGGATGTTTGTGTTCATTAAAATCTCTCACTAATTCATTGTCTTGTCCACTGTCCGTAA 

CCCMCCTGGGAnGGTTTGAGTGAGTCTCTCAGACTTTCTGCCTTGGAGTTTGTGAGAG 
[A.T] 

GATGGCATACTCTGTGACCACTGTCACCCTAAAACCAAAAAGGCCCCTCTTGACAAGGAG 
TCTGAGGATTTTAGACCCAGGAAGAATGAGTGATGGGCATATATATATCCTATTACTGAG 
GCATGAGAAGAGTGGAATGGGTGGGTTGAGGTGGTGTTTTAAGGCCTCTTGCCAGCTTGT 
TTAACTCTTCTCTGGGGAACGAGGGGGACAACTGTGTACATTGGCTGCTCCAGAATGATG 
nGAGCAATCTTGAAGTGCCAGGAGCTGTGCTTTGTCTATTCATGGCCCCTGTGCCTGTG 

TGAGTTGGAACAGTTTGATACCAAAACCATCCCCCCGCCCCCCAACCCCCAGCCTAGGGT 
CCGTGGAAAAATTGGCCCCTGGTGCCAAAAAGGTTGAGGACTGCTGATCTAGAGGACCAA 
TnAnCMTGTTGGTITGAGTAMTGAGCTCnGGAnAGGTGATGGAAAAATCTGAAAA 
MCAGGGCTTTTGAGGMTAGGAAMGGCAGTMCATGTTTAACCCAGAGAGAAGTTTCT 
GGCTGTTGGCTGGGAATAGTCATAGGAAGGGCTGACACTGAAAAGAAGGAGATTGTGTTC 
[G.A] 

TTTCTTCnCTCAGAGCTATMGCAMGGCTGAAAGTrCTAGAAAAAGGCMGTTTTGTT 
TCAGTAGAAAAMGGATMTCAGMCCATTTTTAGAAMTGGAATGAGACTACTTTTGAG 
GCCATGAGTTCCTTGTCCCTGGAGAGATGAGCAGAGGTTGGACAAGTGCTTACCAGAGAT 
CnGTGGAGGCAGAMCTGTGCATCTAGCAGAGCATTGGCCTAACCCTTTCAMTGAGAT 

GCTGTTAACTCAGTCTTATTCTACATGGTAGGAATCCTGTCCCTTTGCCTCCTGCTACTT 

ACMCGTAAMTAGnGIAMTTTGnGGTGGlAMGMGAGCAGTCCACTCCAGAGGCTG^ 
ATGGGCATGCCTGGCCCCCAAGGTCTGAAGTGGTAGGGCTGTGCCTATATCCTGAGAATG 
AGATAGACTAGGCAGGCACCTTGTGCTGTAGATTCCAGCTCCTGCACATAGCTCTTGTTG 
TAAMCATCCCTGTGCTTATACCMGTAATTGAGnGACCTTTAAACACTTGCCTCTTCC 
CTGGGMCCATATAGGGGATTGGCCTGGAGACGTCTGGCCTCTGGAAGAGTTGGAAAGCA 
[G.A] 

CCATCATTAnATCCTTTCCTTTCAGCTATMCTCAGAGCTCTCMGTCTTTTCTGTGGA 
TCTTATTGCCTTGGnCTTGCCCCTTTTACTCCCAGGGAAGTTGATTCTGTCTTTTCTGT 
TCCATTTAGTATGACAGGAGCAGAGAATGTCAGAGCTGTAAGGGACCTTATAGTTAAAGC 
CTTTGGCTGGTCCTTTCATnTATAGCTGGGACTAATAAGTAACGTCAAAACCCAATGAG 
TTCACAGAnGGGTCTCGCCTTGGCATGTMCCCATATGnCATATTCTTGCTGTTTTCC 

6599 CTGTAATCCTAGCACTCTGGGAGGCCGAGGCAGAAGGATCGCTTGAGCCCATGAGCCCAG 

GAGTTTGAGACCAGCCTGGCCAACATGGCAAAACTCCACCTCTACAAAAAATACAAAAAT 
ATTAGCCAGGCGTGATGGCACACACCTGTAGTCCCAGCTACTTGGGAAGCtGAGGAGCGA 
TGATTACCTGAGCCCAGGGATATCAAGGCTGTAGTGAGCTGTGATCATGCCACTGTACTC 
CATCCAGCTGGGGGACAGAGTGAAACCCCTGTCTCAAAACAAAACAAATGAAAAAAAAAA 
C-.A.C] 

CCTTAATAATCAGTAACTGTCACTTTATATTATGTTGTGAGTGTGTGTCTATATACACCT 
ATAtGTATACATTTCTCTTATTACACATTCATTGGTGATCTGATGTGGAGCCCCAGGGAT 
TAAGGGCAACTTTGAACTACCCTGACACAATCAAGCCAAATATCATTCCCGTGGAGGAAG 
TAGAGTATCTAGGTTCTGTCTCCTAGTTGCAGCTTTACCTTGAGGACAGAGACTCTAATC 
CAGCTGTGCTGMGGAGCACATCTCCTGACnCTGAGCTTTCCCCTGGTAAATTCAAACT 
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6983 CACATTCATTGGTGATCTGATGTGGAGCCCCAGGGATTAAGGGCAACTTTGAACTACCCT 
GACACAATCAAGCCAAATATCATTCCCGTGGAGGMGTAGAGTATCTAGGTTCTGTCTCC 
TAGTTGCAGCTTTACCTTGAGGACAGAGACTCTAATCCAGCTGTGCTGAAGGAGCACATC 
TCCTGACTTCTGAGCTTTCCCCTGGTAAATTCAAACTGGATGTCACGGCGCCCTCAGATA 
GAGCCTGGTMTTTGCCCTGGGGAGAGTGACTGTCTTTTGGATCTMTTTGACTTTTGCC 
[CG] 

CAGTTGGAGGAAAATCTTCAGGGCTAGGAAGGATTGTATTTGTCTGACCCCAGAGATAAC 
CTGGGirnTGAGGAACATGGGGCATCAACCTGAATGGTCTTGTAAGATCTCTCCCACGCC 
AGCTTGCCAGTGTTTCTCTGATGAATFTAGAGTACCTGAGTAGTGCAGGCCTGCTGGGAG 
GAGGACTCTCCCTCTGTGCTACTCAGAGAAATTCATTCTTCAAGGCCCCCTTCCAGCCTT 
GCTCTTACCCAGCTGGGCTACAGnACMTAMGGAMTGACTTTTCTTCTCCCCTTCCC 

9885 GGCGTGCCACCACACCTTGCCA 1 1 1 1 1 1 1 1 1 ATTTTAAGTAGAAACAAGGTCTTATTAAT 
ACTATGTTGCCCAGGCTGGTCTTGAACTCCAGCGATCCTCCTGCCCCAGCCTCCCAAAGT 
GCTTGGGATrACGGAAGTAAGCCACTGTGCCTGGCCAGTGCMCCCCCATTTTATACTAA 
MCAGGMGGCCCAGAAAGGTTTGGAGTAACTTGTCCAGGGTCACACAGATGATATTTGA 
ACTCAGGTCTCCCTGGCTCCCAAGAGAGTCTGCTTTCCACTAGGACTCCCAGGAGAAAAA 
[A.-] 

AAAAAAAAAMCAGTAGAC7TGGAGACAGAAMTCTGATTTGAGTCTTAGTTGAGCTAGG 
CTAACTGTGTAACTGTGGGCAAGTTCCTTAGCCCCTGTGAGCCTCAGnTCTTATCTGTA 
AAATGTCATAAAAGAAATCCATCTCATGGAGTAGTTGTGATGATCAAGGACTCTGAAAAC 
ATTAGAATGGTTTAATGTGAAGGATTAGCAGCAGCACATGGCAACATTGTGCATCTTATA 
TTMCTATCCAMTATATCMGCGTCATTTGCTATATATAAMGTCATCAAATTAGGCAC 

12538 ACTTGGGAGGCTGAGGCAGGAGAATCACTTGAACCTGGGAGGCAGAGGTTGCAGTGAGCC 
CAGATCACGCCACTGCACTCCAGCCTGGTGACAGAGTAAGACTCCATCTCAAAAAAAAAA 
AAAAAAAAAAAMTTCCTTAATTTGGCCTACAGTAGAGCCCTCCGTAATGTGGCCTCTCT 
CCACATCTCCACAACCTCCTGCTCCCTGCACTTCAGCCTCACCTCTCTTCTGGACAGGCC 
CTCCTTCTGACAAGGGC 1 I Ittl ICATTCTGCTCCCTCTGCCTAGAATGCCCCCTTACTCT 
[G.T] 

TTCACTTMCTCCTGCTTATCGTTTAGATCTTTACCTGGATGGCTCAGAGAAATATAGAA 
GTMTTCCT(^CCCTGAAAMTAGGTTAGGTCCCTGTTTTATGTTTTCATAGACCTTTCC 
mGAGGCTTTTmAAAAMGTAGTmMTCTCACAmATTCATGTGA TCATC TCCT 
TAATGATATCTTMGACCTCTMTAGMCMT1TGGTCATGGACTGTGGGGTTTTTGCCC 
CTCATTGTGTCAGCACTGAGCATATTGTTGGCATAGGAGGGATATTTGTTGAATGAATTG 

17707 GTAGTGGGTGCTCAGAGTGTnGCTGGGTGAATGATGTATnGTTGMCGACTCTTTGGA 
CACTTGMTAAAGTCCATCCAGTATGCACCATTACCATCTCTTCGCTCTACAATATTCTT 
TTAGGCMGAGCnATCTTnGAGGTGATAAGATAAGCTCAAACTTATGTAGACTAAGAC. 
CTCAGTCTGTAAATGTCATCCCTAAGTCTTAAACCATCAAAACCAG GGCCTCAAGGAATG 
GCATGCCTTCTOmTGTAGCMCCTGCTGT^^ 
[T.C] 

CCCAAAAGCTAGAGTCCCTTCTCCCATGGGCAGTGCTGGAAGTGTGCTAACAAATTCTTT 
CTCCATACTGCTTACGATTACAAAAAAAACCCTCAGCATCTCATGCCAGACTTGAGTTAA 
. GGTTGTmcrnTGTGTGTCAGCTGTATTCTGGTCATGACTTCCTGATGATGCCCTATA 
GAGATTTTGCTGAGATCAGAGGGTGCTCCACTGCCATCAGTAGCACTGACTCTTGCAGAA 
GCACCGmCTGAAGnGGCTMTGTCATCCCTCACGmGmGTnGAAATTTGTTTT 

FIG. 3-27 



U.S. Patent Jan. 22, 2002 Sheet 33 of 41 US 6,340,583 Bl 



18219 TGCCATCAGTAGCACTGACTCTTGCAGAAGCACCGTTTCTGAAGTTGGCTAATGTCATCC - 

CTCACGTTTGTnGTTTGAMTTTGTTTTAGnCCAGAGATAGCACTTTCAtGGAATGAC 
GCTATCTTCTAGAATCAC 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 GAGTTGGAGTCTCGCTGTGTCGCCAGG 
CTGGAGTGCAGTGGCACAATCTCAGCTCACTGCAATCTCCACCTTCCGGGTTCAAGTGAT 
TCCCCTGCCTCAGCCTCCCGAGGAGCTGTTACTACAGGCGCACACCCCCACTCCTGGCTA 
[-.A] 

TTTTATGTGTTTTAGTAGAGACGGGGTTTCACCGTGTTGGCCAGGATGGTCTCGATCTCC 
TGACTTTGTGATCTGCCTGCTTCAGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGTCAC 
CGCGCCTGGCCTAGAATCACCTTTTTATACCATAACGTGAGCACCACTGCCGCGTCACCA 
AGGAAAGAGAGAGGCAGCTACTGTGGGGTTACAMTGGGTAAGAGTGGCACCAGGAAGGT 
GAMGTCTCTACTTAGCCMGGCTTMCAAMTGTCMTCACCAMCATTTATTTATTAA 

19670 GACCCCCATGATGAGCAACTATAGCACTAGAACAGTGATAATAACTAATGTTTATAATGC 

ATCTTCAGTTTACAGAGGGCTTTTCTACTCATCATCTAGTTTAGTTCCTGCAACAACCTC 
TTGAGGMTATAGCACMGCAGGACMGGGMGCCCAGAGATGnAMTMTTTATCCAA 
GTTTATGCTGCTGGGMGGGCAGCACTGAMTTAAMGAAMGTTTTCTGAGCTCAAATC 
CCATGCCCTTTCCTCAATGTGAGCTCTAGCAAGGTATTCAGGAATCCTGCCTCTACAGTT 
[C.T] 

AGAGCCTCAMTTGCTGGGTATGnGAGncnGTATCTGATTTTTCTAGATTTCCTGCC 
CACATTCTTACTGTCTGGATATCAGGAAAGAGTTTATCAAATGCCTGTGGAAATCCAAGA 
TAAGGTCTCATGATGAGTAACCCAGTGAAAACATGAAGTCAAGTCTAACTAGTCACTACT 
ATTTCACTACTGCTGACTCCTGATGATCAGCTCCT7TTCTMGTGCTTACTGTCCACTTA 
nCCATCATCTGCCTAGMTTTATGTGAAGGAATCAAAGCAAAAGGATCATAAGGCTTCC 

21153 GGACCCnGTTTTAGMGGATGACTGCTGCTATMTGTAGAMGTGATnGGMGAGGGG 

AGGAGTGGGGCACGAMGATGGnAGTAGATGGGGGTGGTMTGCTTACCTTTCAGTATT 
TGGAGGCTTCGGAGTCCTCAAAAA7TCTCTTCCTTGATTGGAGTCCTCCCAGCCAATAGA 
GGGCTTCACACAMCAGTncnGGGTTnGMnGTTTGACCAGAGCTTTCTTCCGACA 
AMGGTTGGGGTGAnCATTCACTTACCACACCTTGCCTGAACATTCACTTGGGGCTGCC 
[G.T] 

GnATGAAGGCTATTGTTCTCCAGCCTGTCACAGACGCTTTGAAGACCTGTGCCTCAGCT 
GGTTCTMGGAGTCAGTTTGTTCAGCTCCGTGCCAGGTTTCCAACTTATGAAATGTGCTG 
GAGATTMCACCTCTCCTGCCATTTTATCCCTACTATAATTGCCAGTCAAAGGATTCCTG 
CAGTTGCCTCTGGCAGCCATAACTGAT6AATGTTCTGCCAGCTGCTCTGAGGACCTAGAA 
GAGCAGTTTTCTATCCAGGACCAGTTTCCAAGGGTGGGAGGGTGAAATATATCCTCCAGT 

24566 CTACTCTGGAGGCTGAGGTGAGAGGATCACTTGAGTCCAGAAGGTCGAGGTCAAGATTGT 

AGTGAGCCATGATGGCATCACCGCACTCCAGCCTGAGTGACAGAGAGAGACCCTGACTCA 
AAAAAAAAAAMCAAAAAAAAAAAACACCCTCACCACTTATCAGCTATTrGTCTTGAGAA 
TAGTGACATMCCCCTCAGiMCCTATTTCCTAATCTGTTAAATGAGGCTGATGACGTTTC 
CTCCTTnACTGGCMTTTAAACATGATGGATAATAAATGCTAAGCACTTAACACAGGGC 
[C-] 

TAGMGATATTMCTGCTCMTAAATGGTAGCTTCTTAACAGTATTCAAACCCATGTGCT 
CTTATCACATGCATTGTTGTCCCTGTGTCCAGTTGGTGGAATGGGAAAAGGCTCCCTTGT 
MCCCCATCTACCATCTTTATCAGACTTTCCTGCCATGGTTCACAGTAAGAGATAGAAGC 
TGCACGGTGACTTCTGGCTCTTTACAATGGTGAGCGGTGTGTGCCTGGTAAGGGAGAGCT 
GATGTCACTGCCCCAAATCCAGTAGTGAGATCTGAGTGTTCTGGTTTCCTCCAGCAGCCT 
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GATTTGCAGCT6AGCCTGTCTATCTG6TGT6GGAAGAAGATGGGGAGTTACTTGTCAGTC - . 

CCGGCTTACTTCACCTCCAGAGACCTGTTTCGGTGAGTTGGTCTCCGAGTTCCCCTCTCC 

ATCTCTCCTGGCCCCTGGTCCTGAGAGGAGGGTGGTCTCCCTAAATCTCCTTCTCACTTA 

GTCCTTTACCATCGGTTCTGCCGGGCAGAAGCCAGCGGAGGTTATACCCAAGGAGAATCG 

GCCTTGTGAGGTACCCCCATTATGTCCTGGAAGTGGTGAGGGGAGGGATATACCCAGAAG 

[G.A] 

AACTTCTTAGGGAGCTCCAGCTCCCCTTCTATCCCAGACAAACCTGAAGGAGCCTCCAAA 
AGATGCCACTGACCTGCCCATTGTAGATGTrACTGCTTCCGGGGGGAATAGCCCAAATAG 
AGTGCTGTTTCCAGCTCTCACATGTCTTACCTGCGGGCCATGCTGCCTGCCCAGGAATTT 
GTCCCMCMGCAGGATGGGCAGGTTTTGCCAAACTGTGGAAACTGGCAAGTCCTGGGTG 
TGGGTAGCCTGGTACACAGTAGGCACCTTATAAACGTTTGTTCTCTTAATGGCAGGCACA 

TGGGGAMGACCTGGGCGAGTGCnCTAAGACTGGAGCAATGGGCnTAGAGTGnCCTG 
AGCTGCTGGGCCAGCCCCCACACCTCCTCAGTCCCTAGGCCTAAGTACCTCCACGAGCCT 
CTCTCTGTGGGGCnCTCAGAGGGAGATGTGGAMGTCTACCTCTMCCTGGCTTTCTTT 
GCTCATTGCCCCACTCCACCTCCCATAGAAACTCCCCAGGGGGTTTCTGGCCCTCTGGGT 
CCCnCTGMTGGAGCCAnCCAGGCTAGGGTGGGGmGTTnCAnCTTTGGGAGCAG 

[C.G] 

CTGnGnCCAAAMGGCTGCCTCCCCCTCACCAGTGGTCCTGGTCGACTTTTCCCTTCT 
GGCTTCTCTAAGCTAGGTCCAGTGCCCAGATCTTGCTGCCGGGATACTAGTCAGGTGGCC 
AGGCCCTGGGCAGAAAAGCAGTGTACCATGTGGTTTTGTGGAATGACCGGACCCTGGTAG 
ATTGCTGGGAAGTGTCTGGACAGGGGGAAGGGGGAAGGGAACTGGTCCTCAATGCTGACT 
CTACCMGCGCCCTGCTAGACACTTTATCCTTTAATCTCTCAACAGCCTAAAGAGATTAT . 

AGATGTGGAMCTCTACCTCTMCCTGGCTTTCTTTGCTCATTGCCCCACtCCACCTCCC 

ATAGAMCTCCC(^GGGGGmCTGGCCCTCTGGGTCCCnCTGAATGGAGCCAnCCAG 

GCTAGGGTGGGGTTTGTTTTCATTCTTTGGGAGCAGCCTGTTGTTCCAAAAAGGCTGCCT 

CCCCCTCACCAGTGGTCCTGGTCGACTTnCCCTTCTGGCTTCTCTAAGCTAGGTCCAGT 
GCCCAGATCTTGCTGCCGGGATACTAGTCAGGTGGCCAGGCCCTGGGCAGAAAAGCAGTG 

[T.C] 

ACCATGTGGTTTTGTGGAATGACCGGACCCTGGTAGATTGCTGGGAAGTGTCTGGACAGG 
GGGAAGGGGGAAGGGAACTGGTCCTCAATGCTGACTCTACCAAGCGCCCTGCTAGACACT 
TTATCCTHMTCTCTCMCAGCCTAMGAGAnATATATCCCC ATTTT ACAGATGAGGC 
MCCAGTTTCMCAGAGTTMCATATGGAGCCTCACTGGGCAGCTtnTCTGTCTTCCTG 
ACTTTCTCTCATCCnCAGGGGGCTGCAGGTnGTTnCTTCTCCTAGTGGAGAGGAAAT 

MGAGCCMTGGAMnGATCnGAGmAGGAGAMGCTmACATGTGGAATTAAGAT 
GCCMGTGnGMGTAGCCACAmCAGGTCCTCATTMTTTCTCTTAATCCTGGGAAGG 
CAGCTTAGGAGMGGGTTGnCCTTTAGGAGCCAGGMCTATACCCCTTTTACCCTTGGA 
GAGGCAGGGAAGCCAGGGAGGACACAACTTCTCAGGAAGAGGAGAAGCTAGAGCAGATAG 
TGAACTCTCAACCTGAACCTTTAAGGGCCAGACCAGTAATGCCACCCAAGTCCACCTGCC 

[G.A] 

TTTGTCTTGTTCTGTCCCAGGCTTTCTGGAGAACCTGATCTTCTTGCCCCTACCCCCAAG 
CTCCGTTTGCCCAGCTAGAGTCTGGGGGGTACTGACTGACTTTCGTAGACATTCTTCCCT 
TCCCCAAATAAGAGGCCACATTCeTGAAGTCACTTCTGAAGAGATAGCTGCCACACAGGG 
CTCTTTCCCCCCAGGGAGGGACCACCCAGACCCTCTGCTCTCCCAGGTATCCGTTACCAC 
ATCACTACCTGGTCAGAMGCTGTnCTGCCAmGCCCCTCCCTCTTTTATTATAGGAT 
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AAGTAGAAGCTAGACTTCTTGGGCTCCTGAACAGGGTCCTTGCTGGATTCTGTGAAACAA 

ATTAAGTTCTTGACCCTAGGCCTCTGGGGGAGTACAAAGTCTATGGGAGTTCTGGGGCTG 
TGGTTGCAAGGAAAGTGACGCAACCAGATTCCATGGGGACATGATCAGGCGTGACATGTG 
AGGGAGGAAGAGGGAGCAAGGGAATGAAGAATACAACTTCTGTGTeCCATACACCCCTGC 
CTGACAGGCCATACATACTCAGCAGAGAATGCACTGTCTTTCCTACCACACTAGCGTGAG 
[G.A] 

AGTGAGCTGCMTTACCACTGTGCTTCCMGTMGAAAATACCTCAAATTGGAATTTACA 

AMGAGGTAMTTAGGGAGTGGCTTTTGTCGGACATCTTT^ 

GMTTTCACTTMTGTCCMTACTGATTTAATGAGCTTGGGTTTACACATTATCTCTTGA 

AGAAMCAMTGMCCT7TGTGTTCCAAAGCMTCCATGTTTAAAGGGAAAAAATTATGC 

ATAACTCTGCCCAGCTTCACAGTAACCTTTGGCAGGTGCCTTAGGTCCTCTGGGACTCTT 

MTCCATGTTTAMGGGAAAAMTTATGCATMCTCTGCCCAGCTTCACAGTAACCTTTG 
GCAGGTGCCTTAGGTCCTCTGGGACTCTTTTCCTTATCTGAAAAATGAAGGACTTGGATC 
AGGTGMTGGTTCCCAGCTCTGCMCTTATGTGGCTCCTCAGAGGCACACAAGCTCTTTT 
CCATTATTTGCCAAATAATGGAGGCCCTGTCTTTAACTGCAGTACAACTACACAAAATAC 
nGAMCTACAGTCTTCCTGGTTTTTGGTTGGAACTGAATCAGTGCACTCTAGCAACACT 

[-.T] 

AmcnGCTGTTCGTAGGCmAnATGTGrrTGGnMTTTmAAAACAACAATAAC 
ATAnCCATMTMTTACAGCTTMTTGGCAGACTGTTTCAGTCTATAGGATCTGCAGGA 
AGGAGGAGTMTAAAGGGAlTnTGACTGAGCTCTTATGGAACAGAGTCTCTCTAGGCCC 
CTGTCATATCTGCCCTTCTGGGCCCTGGGGAAAAGTTGGCATCCCCAGTTGTGGTGCTCT 
CCAGGTGCCCTCAGGCTGTGGTGGAGGGAGCTTCCCATTCTCTCCTTCAGCCCACTCAAT 

MCTACAGTCnCCTGGTTnTGGnGGMCTGMTCAGTGCACTCTAGCMCACnATT 
TCTTGCTGTTCGTAGGCTTCAnATGTGTTTGGnMTTTTTTAAAACAACAATAACATA 
TTCCATMTAATTACAGCTTAATTGGCAGACTGTTTCAGTCTATAGGATCTGCAGGAAGG 
AGGAGTMTAMGGGATTTTTGACTGAGCTCTTATGGAACAGAGTCTCTCTAGGCCCCTG 
TCATATCTGCCCTTCTGGGCCCTGGGGAAAAGTTGGCATCCCCAGTTGTGGTGCTCTCCA 
[G.A] 

GTGCCCTCAGGCTGTGGTGGAGGGAGCTTCCCATTCTCTCCTTCAGCCCACTCAATTCAG 
AGGCTAGGGGCTGAAAGAAGCTTCTCTACAACTGGCTGTTCACTGGGAGGTTAAGGGATG 
ACCATCCAGCCAGGCCTTCCTCAGGACATGGGAGGGCTTATGCTTTAACATGTGTAAATC 
CACTGCMTMTGIACTGGTTCTTTTACCCCATMGGTTGAGMTTTACCTGTAAACATTT 
TTGTCTGAAGMTTTGGATGTAAGTGAGGGCTGGGCCTCTATCTTATCTCACTTGGCTTC 

GGACATGGGAGGGCTTATGCTTTAACATGTGTAAATCCACTGCAATAATGACTGGTTCTT 
TTACCCCATMGGTTGAGMTTTACCTGTAMCATTTTTGTCTGMGMTTTGGATGTAA 
GTGAGGGCTGGGCCTCTATCnATCTCACTTGGCTTCTCTCAGCACAGCACCTTGCCTGC 
nGnCmCACATCCTA^TGCACAGTMCTATTTCCTMTOnA^TCTATTAGA 
ATCMTTGATTTCAGCTGGGCnGGTGGCTCCTTCCTGTAATCCCAGCACTTTGGGAGGC 
[T.C] 

MGGCTGGAGGATCACCTGAGTCCAGGAGTTTAAGACCAGCCTGGGCAACATAGGGAGAC 
CCTGTCTCTACAAAAAATAAAAAATTAGCCAGGCATGGTGGTGTGCACCTGTAGTCCCAG 
CTACTCAGGAGGCTGAGGCAGGAGGATCTCTTGAGCCTGGGAGGTCAGACTACAGTGAGC 
AATGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGTAAGACTCTGTCTCTTAAAAAA 
AAAAAAAAAAMGnGATTTCTATnGGATAGATAMTMTTCATTTTAGGACCTTTCTT 
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CTGACTTCAAGTGATCCACCC6CCTCGGCCTCCCAAAGTGCT6GGATTATAAGCATAAGC - 

CACTGTGCCCAGCTGCTCTCTATATTTTTMTACATAnATTTCCATTMfnTCACAGC 

AGTTCATTTTATAGATGAGGAAACTAGGCCAGAGAAGTAAAATATCTTGCCCAAGATGAT 

GTMCTAGTAAGTGGCAGGATCMGATTCAAACCAAGCAATGTTCAAACCTCTTGGAAGC 

AAGAATGTGGCCACTGTGGAAGGTGCAAGGCCTTGACAACAAGAATAGGGAAAAGAAGGA 

CA.G] 

CTAGAAGGAAAGAGATGGCATGGGCTCAGCAGGCCAGGGAGCTCTTAGCTGTGTGTGTTG 
GGAAGCTCAGAAGGGAGGAAGAGGTTGTCTGTGCAGGTAAGTCCTGAGMCACACCAGAC 
TTTTGAGAGGTGGAGCTTCATAGCCAGGTCATTAGGGGAGAAGGGAGCTATAGAI 1 1 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 AGAGACGGGGTCTTACTATGTTGCCCAGGCTG 
GTCTTGAACTCCTGGGCTCAAGTGATCCTCCCACCTCAGCCTCCCAAAGTGCTGGGATTA 

AMTCCAGCAGATCCAnGAGAGTTTMGCAGCMGGTGnGTGACCAAGTTMCATTTT 
AGAAGGATCACTGGTATGGAGG7TGGATTGGAGAGGGGAAAGCCTAAAGGTATAGAGACT 
AGTTAGGAAGCTATTGTAGGCTGGGCATGGTGGTTCATGCCTGTAATCTCAGCACTTTGG 
GAGGCTGAGGTGGGAGGATTGCTTGAGGCCAGGAGTTGAAGACCAACCTGGCCAACATAG 
CMGACCCCGTCTCTGTTTTTCTTAATTAAAAGAAAAGTCCAGACGTAGACATAGTGGCT 

[T.C] 

ACGCCTGTMTGCCAGCACTTTGGGAGGCCAAGGTGGGCAGATTGCTTGAGGTCAAGAGT 
TTGGGATTAGGCCAGGCGCAGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG 
GTGGGCGGATCACAAGGTCAGGAGATCAAGACCATCCTGGCTAACACAATGAAACCCCGT 
CTCTACTAAAAGTACAAAAATTAGCCGGGCATGGTGGCGGACGCCTGTAGTCCCAGCTAC 
TCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCTAGGAGGCGGAGCTTGCTGTGAGCAGA 

GTTCTGTCCTATGTCTGTCTCTCGGATGMGCTGAGCTGGCTTTCAGAAGCCTGCAGAGT 
TAGGAAAGGAACCAGCTGGCCAGGGACAGACTATGAGGATTGTGCTGACCCAGCTGCCCC 
TGTGGGGATCACAGTTTACAGCCAGAGCCTGTGCGGACCCAGCTGTCTGCCAGGTTTCCT 
TAGAAACCTGAGAGTCAGTCTCTGTCCACTGAACTCCTAAGCTGGACAGGAGGCAGTGAT 
GCTAAACCCTGAAGGGCAACATGGCCTATGGAGAAAGCATGGAGCTCAGAGCCTGGAGTA 
[CG] 

GGGCACAGATAGGAnGAATAMnGTGTAGAMGACTTTGAAAACAATAAAGCAAAAGA 
TGMTGMCGTTTTTTTTAGACTTGAGGGACCAACAACCCCCAAACCCCAGATTCTGCCA 
GGTCCATGGGGAAGGAGAAGTTGCCTTGAGTGGAAGCCCCAAGTAGGGAGACTTACAGAA 
AAGMGTCAAGAGCACTGGCTCCCAGGCAGAAATACTGATACCCTACTGGGGCTTCAGGC 
TGAGCTCCTCCCTTCACAAATCACTTCATCTCTCTGAGCCTGTTTCTGCATCTGTGACAT 

CTCTGAGCCTGTTTCTGCATCTGTGACATAAGATGGTAAGATAAAGGTGGCTGTCTCACC 
AATTATGTAAGGATTAAATGTGGAAAAGGACATAAAGTTGTATAGTGCTGCCATAGGGAC 
AGTGnCAGTAMCGTGACACAnCTTAGTATCACTAAGAATCAGGTTCTTGGCCAGGCA 
CCGTGGGTCATGCCTGTAATCCGAACACTCTGGGAGGCCTAGGTCGGAGGATGGCTTGAA 
CACAGGAGTTTGAGACCAGCCTGAGCAACATAGTGAGACACTGTCTCTACAAAAAAAAAA 
[T,A] 

MTMTMTMnGTTTTTAATTAGATGGGCAGGGCACTGTGGCTCACACCTGTAATCCC 
AGCACTTTGGGAGGCCAAGGCCGGAGGATTGCTTGAGGCCAGGAGTTCAGGAGCAGCCTG 
GGCCACAnCCTGTCTCTACAAAGAATAAAAAAGTTAACTGGGCATGGTGGCACATGCCT 
GTAATCCCAGCTACTCAAGAGGCTGAGGAGGAGGATTGCCTGAGCCCAGGAGTTCAAGAC 
TGCAGTGAGCCTTGATCACACCACTGTACTACAGCTTGGGCAACAGAGTGAGACCTTGTC 
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41305 CTGAGCCTGTTTCTGCATCTGTGACATAAGATGGTAAGATAAAGGTGGCTGTCTCACCAA - 

TTATGTAAGGATTAAATGTGGAAAAGGACATAAAGTTGTATAGTGCTGCCATAGGGACAG 
TGTTCAGTAAACGTGACACATTCTTAGTATCACTAAGAATCAGGTrCTTGGCCAGGCACC 
GTGGCTCATGCCTGTAATCCCAACACTCTGGGAGGCCTAGGTCGGAGGATGGCTTGAACA 
CAGGAGTnGAGACCAGCCTGAGCAACATAGTGAGACACTGTCTCTACAAAAAAAAAATA 
[-.A] 

TMTMTMnGTTnTAATTAGATGGGCAGGGCACTGTGGCTCACACCTGTAATCCCAG 
CACTTTGGGAGGCCAAGGCCGGAGGATTGCTTGAGGCCAGGAGTTCAGGAGCAGCCTGGG 
CCACATTCCTGTCTCTACAAAGAATAAAAAAGTTAACTGGGCATGGTGGCACATGCCTGT 
AATCCCAGCTACTCAAGAGGCTGAGGAGGAGGATTGCCTGAGCCCAGGAGTTCAAGACTG 
CAGTGAGCCTTGATCACACCACTGTACTACAGCTTGGGCAACAGAGTGAGACCTTGTCTC 

41457 CTAAGAATCAGGTTCTTGGCCAGGCACCGTGGCTCATGCCTGTAATCCCAACACTCTGGG 

AGGCCTAGGTCGGAGGATGGCTTGAACACAGGAGTTTGAGACCAGCCTGAGCAACATAGT 
GAGACACTGTCTCTACAAAAAAAAAATMTMTMTMnGTTTTTAATTAGATGGGCAG 
GGCACTGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCAAGGCCGGAGGATTGCT 
TGAGGCCAGGIAGTTCAGGAGCAGCCTGGGCCACATTCCTGTCTCTACAAAGAATAAAAAA 
[G.C] 

TTAACTGGGCATGGTGGCACATGCCTGTAATCCCAGCTACTCAAGAGGCTGAGGAGGAGG 
ATTGCCTGAGCCCAGGAGTTCAAGACTGCAGTGAGCCTTGATCACACCACTGTACTACAG 
CTTGGGCAACAGAGTGAGACCTTGTCTCCAAAAAAAAAAG 1 1 1 Gl 1 1 1 1 1 1 1 IATCCACT 
CTCCTCACCAAACAAACTGAGTAAGTTAGAGCCCTCTCAGCTGGCATGTGTTGGAAACAG 
TGCCCTCTCATTAAAGTGCTGCCCTCACTCCCATTGCCTCTTGGCCTTGGTCAGTATGAT 

43168 AGCTACTTGGGAGGCTGAGGCAGGAGMTCGCTTGAACCTGGAAGGCGGAGGTCGCAGTG 

AGCCGAGATCGTGCCATTGCACTTCAGCCTGGGCGACAGAGCGAGACTCTGTCTCAAAAA 
TAATAATAATAACAATAACTAGCCGGGCCTGGTGGCACATGCCTGTAGTCCCAGTTACTC 

AGGAGGC6GAGGCATGAGACTCAGGT6AACTAGGGAGACAGAGGTTGCAGTGAGCCAAGA 

TCACACCACTGCACTCCAGCCTGGTTGACAGAGCGAGACTCTGTCTCAAAAAAAAAAAAA 
[A.-.T] 

CCCATTTGCTCATTTTTTGGATACTAGTATAACTATCACTCTAAACCAGTTAGTACTTAA 
ATCAAGCAGATATGGGAGATGGTGAATTACCATCTACAGTGTTGTCATATATGTCACATA 
CTGAGCATTATCAGCTAGTAGAATCTAGTTAATTGTTCTATGTGTGATGTATGCAGAGTT 
CCCATTTTGAATGTGTTTnACTATGCnAAATAAATGACTGATGTCAGCAACCCCAAAA 
TGATACATCTGATGTMGAGCCCCTGTTCCCCAATAATAACATCTAAACTATAGACATTG 

43357 AGGCATGAGACTCAGGTGAACTAGGGAGACAGAGGTTGCAGTGAGCCAAGATCACACCAC 

TGCACTCCAGCCTGGnGACAGAGCGAGACTCTGTCTCAAAAAAAAAAAAATCCCATTTG 
CTCATTTTTTGGATACTAGTATMCTATCACTCTAMGCAGTTAGTACTTAAATCAAGCA 
GATATGGGAGATGGTGAATTACCATCTACAGTGTTGTCATATATGTCACATACTGAGCAT 
TATCAGCTAGTAGAATCTAGTTMTTGTTCTATGTGTGATGTATGCAGAGTTCCCATTTT 
CT.G] 

MTGTGTTTnACTATGCTTAAATAAATGACTGATGTCAGCAACCCCAAAATGATACATC 
TGATGTAAGAGCCCCTGTTCCCCAATAATAACATCTAAACTATAGACATTGGAATGAACA 
GGTGCCCCTMGTTTCCTCCCTCCAGGGTTTCTTGGCCGGTCTCTGAGGACTACACATCC 
CTACTCCCGTCTTTCCTCATCTTCAGGCGCAGTAACAGTATCTCCAAGTCCCCTGGCCCC 
AGCTCCCCAAAGGAGCCCCTGCTGTTCAGCCGTGACATCAGCCGCTCAGAATCCCTTCGT 
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CCAGCTTTCCTTGGCTTCCCCCACCCCCAGGTGAAAGT6ATGCGCAGCCTGGACCACCCC " 

AATGTGCTCAAGTTCATTGGTGTGCTGTACMGGATAAGAAGCTGAACCTGCTGACAGAG 
TACATTGAGGGGGGCACACTGAAGGACTTTCTGCGCAGTATGGTGAGCACACCACCCCAT 

AGTCTCCAGGAGCCTTGGTGGGTTGTCAGACACCTATGCTATCACTACCCTAGGAGCTTA 
AAGGGCAGAGGGGCCCTGCTTTGCCTCCAAAGGACCATGCTGGGTGGGACTGAGCATACA 

[T.C] 

AGGGAGGCTTCACTGGGAGACCACATTGACCCATGGGGCCTGGACCACGAGTGGGACAGG 
GCTCAACAGCCTCTGAAMTCATTCCCCATTCTGCAGGATCCGnCCCCTGGCAGCAGAA 
GGTCAGGTTTGCCAAAGGAATCGCCTCCGGAATGGTGAGTCCCACCAACAAACCTGCCAG 
CAGGGCGAGAGTAGGGAGAGGTGTGAGAATTGTGGGCTTCACTGGAAGGTAGAGACCCCT 
TCCTATGCAACTTGTGTGGGCTGGGTCAGCAGCTATTCAnGAGTTTGTCTGTGTCACTG 

AATTAGCTGGGCGTGGTGGTGCACGCCTGTAGTCCCAGCTACTCAGGAGGCCGAGGCAGG 
AGAATAGCTTGAACCTGGGAGGCAGAAGTTGCAGTGAGCCAAGATCACACCACTGCATTC 
CAGCCTGGGTGACAGAGTGAGACTTCATCTCAAAAAAAAAAAAAAAGAGAGACTGATATG 
GTTAGTACATTGGGGTGGAATGCGGAGGGTCCAGGGAATGGAGCCCTGCATAGGGGGCTA 
ATGAMCATTTCAGATTTCTGAATTAAGGTAGTGGCTGTGGGGACAGGAGCCTGGGAGGC 
[A.C] 

GGGTGGAGTCAGAATGGAGAGACTGGTTGGCAATGAGGGAACAGGAGGAGGAGGAGGAGG 
AGTTACGAGTGGCnGAGGTGTCACTTACCAGACATTTGGGGGATGGGGGATAGCCGTGA 
TTGTTGAGCAACTGGTTTGGGAAGAGCTAGCATTGATCCCTGCTGTTCTGTGCTAGCAGA 
ACCTATCAGCATCTTCTGGGCAGGAAACTGGCTCCATGAGACTGGCTTAGGGAGAGGCTG 
CTAGTCACCTAATCTGCAGAGAAGGGGCAGCTGGAGCTGTGGGACAGAAGAGGCATCCAT 

GGAGnACGAGTGGCnGAGGTGTCACTTACCAGACATTTGGGGGATGGGGGATAGCCGT 
GATTGTTGAGCAACTGGTTTGGGAAGAGCTAGCATTGATCCCTGCTGTTCTGTGCTAGCA 
GAACCTATCAGCATCTTCTGGGCAGGAAACTGGCTCCATGAGACTGGCTTAGGGAGAGGC 
TGCTAGTCACCTAATCTGCAGAGAAGGGGCAGCTGGAGCTGTGGGACAGAAGAGGCATCC 
ATGTAGCTGGTGGGGGTGTCTCAGCTTGTGAAGAGGAGATGGCTTTGAGCAGGGCTGACA 

[C.A] 

TGAAAAGGCTGGAAGAAAAAAACAGACACACAAGAGTCTCAGGATCAGGTAGCATAGGAA 
AGTTGTGGACAGTCTTTGAGGAGCACTCCCTCAGGCAGGCAGGCAGGCAGGTCATGAGCT 
ATAGCGATTCAGGAAGAGCTCCCTGGGTGTGTGAGCAGCTCCAGGAGCCTAAGGGATGAA 
AGTAGTATTGCAGGGGGCTGGAGAGCAAGGAGTGGCTCCTTCTACATTTGCAAGGGAAGG 
AGAAAGGAAGTTGCTCCTGAGAGTGGTAAGAGTCAGTGGTGGAGGCCTGGAGAGGAGACA 

TTGTGAGGGGTAGAGGAGAGGAGAGACMGGGATGGnAGGATMTGMGGMTGTTTTG 

TTTrrGTTTTTGTTTTTGAGATGGAGTTTCACTCTGTCACCCAGGCTGGA 

GCAATCTTGGCTCACTGCAGCCTCCGCCTCCCAGGTTCAAGCAATC CTCCTG CCTCAGCC 

TCCCMGTAGCTGGGACTAC^GGTGTGCGCGACCACGCCTGGCTMTTTTTGTATTfTCA 

GTAGAGACAGGGTTTCGCCATATTGGCCAGGCTGGTCTCAAATGCCTGACCTCAGGTGAT 

[C.A] 

CACCCGCTTCAGCCTCCCAAAGTGCTGAGATTACAGGCATGAGCTACCGTGCCTGGCCAT 
GMGGMGATnGTTTTAAAAMTTGlTITCTTTMTAnMTTGAACACCTCTGTTCAG 
AGCACTGGGCTGGTGCCAGAGGGTTTCAGACATGAATCAGATCCAGCACCTCATAGAGCC 
TTAATCTGGCACACACACACAGCCACAAGGAGACACAGACAAGGCAGGGTAGGATGAGTG 
GMGCTAGGAGCAGATGCTGATTTGGAACACTTGGCnCTGCAGTGAAGCCCCTTCTTAG 
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54654 GGCCCCGGCCCCGGCCCCCAGGCCAGGCAGTGGCGGCCAA6GACCACGCATC-TACTTTCA 
GAGCCCCCCCCGGGGCCGCAGGAGAGGGCCCGGGCTGGGCGGATGATGAGGGCCCAGTGA 
GGCGCCAAGGGAAGGTCACCATCAAGTATGACCCCAAGGAGCTACGGAAGCACCTCAACC 
TAGAGGAGTGGATCCTGGAGCAGCTCACGCGCCTCTACGACTGCCAGGAAGAGGAGATCT 
CAGAACTAGAGATTGACGTGGATGAGCTCCTGGACATGGAGAGTGACGATGCCTGGGCTT 

[T.C] 

CAGGGTCAAGGAGCTGCTGGTTGACTGTTACAAACCCACAGAGGCCTTCATCTCTGGCCT 
GCTGGACAAGATCCGGGCCATGCAGAAGCTGAGCACACCCCAGAAGAAGTGAGGGTCCCC 
GACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTACCCCCCGACCTCGTAGCAACAG 
CAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCATGAGCAGG GCTCCTCGT GCCCCTG 
GCCCAGGGGTCTCTTCCCCTGCCCCCTCAGTTTTCCACTTTTGGA 1 1 1 1 1 1 1 ATTGTTAT 

54679 GGCAGTGGCGGCCAAGGACCACGCATCTACnTCAGAGCCCCCCCCGGGGCCGCAGGAGA 
GGGCCCGGGCTGGGCGGATGATGAGGGCCCAGTGAGGCGCCAAGGGAAGGTCACCATCAA 
GTATGACCCCAAGGAGCTACGGAAGCACCTCAACCTAGAGGAGTGGATCCTGGAGCAGCT 
CACGCGCCTCTACGACTGCCAGGAAGAGGAGATCTCAGAACTAGAGATTGACGTGGATGA 
GCTCCTGGACATGGAGAGTGACGATGCCTGGGC7TCCAGGGTCAAGGAGCTGCTGGTTGA 

[C.G] 

TGTTACAAACCCACAGAGGCCTTCATCTCTGGCCTGCTGGACAAGATCCGGGCCATGCAG 
AAGCTGAGCACACCCCAGAAGAAGTGAGGGTCCCCGACCCAGGCGAACGGTGGCTCCCAT 
AGGACAATCGCTACCCCCCGACCTCGTAGCAACAGCAATACCGGGGGACCCTGCGGCCAG 
GCCTGGTTCCATGAGCAGGGCTCCTCGTGCCCCTGGCCCAGGGGTCTCTTCCCC TGCCCC 

CTCAGTTTTCCACTTTTGGATTTTT^ 

54693 AGGACCACGCATCTACTTTCAGAGCCCCCCCCGGGGCCGCAGGAGAGGGCCCGGGCTGGG 
CGGATGATGAGGGCCCAGTGAGGCGCCAAGGGAAGGTCACCATCAAGTATGACCCCAAGG 
AGCTACGGAAGCACCTCAACCTAGAGGAGTGGATCCTGGAGCAGCTCACGCGCCTCTACG 
ACTGCCAGGAAGAGGAGATCTCAGAACTAGAGATTGACGTGGATGAGCTCCTGGACATGG 
AGAGTGACGATGCCTGGGCTTCCAGGGTCAAGGAGCTGCTGGTTGACTGTTACAAACCCA 

[A.C] 

AGAGGCCTTCATCTCTGGCCTGCTGGACAAGATCCGGGCCATGCAGAAGCTGAGCACACC 
CCAGAAGAAGTGAGGGTCCCCGACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTAC 
CCCCCGACCTCGTAGCAACAGCAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCATGA 
GCAGGGCTCCTCGTGCCCCTGGCCCAGGGGTCTCnCCCCTGCCTCCTCAGTTTTCCACT 
TTTGGATTTTTTTATTGnATTAMCTGATGGGACTTTGTGTTIT^ 

54706 TACTTTCAGAGCCCCCCCCGGGGCCGCAGGAGAGGGCCCGGGCTGGGCGGATGATGAGGG 
CCCAGTGAGGCGCCAAGGGAAGGTCACCATCAAGTATGACCCCAAGGAGCTACGGAAGCA 
CCTCAACCTAGAGGAGTGGATCCTGGAGCAGCTCACGCGCCTCTACGACTGCCAGGAAGA 
GGAGATCfCAGAACTAGAGATTGACGTGGATGAGCTCCTGGACATGGAGAGTGACGATGC 
CTGGGCnCCAGGGTCAAGGAGCTGCTGGTTGACTGTTACAAACCCACAGAGGCCTTCAT 

[T.C] 

TCTGGCCTGCTGGACAAGATCCGGGCCATGCAGAAGCTGAGCACACCCCAGAAGAAGTGA 
GGGTCCCCGACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTACCCCCCGACCTCGT 
AGCAACAGCAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCATGAGCAGGG CTCCTCG 
TGCCCCTGGCCCAGGGGTCTCTTCGCCTGCCCCCTCAGTTTTCCACTTTTGGAl 1 1 1 1 1 1 
AnGTTAnAMCTGATGGGACmGTGTTmATATTGACTCTGCGGCACGGGCCCm 
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CAGAGCCCCCCCCGGGGCCGCAGGAGAGGGCCCGGGCTGGGCGGATGATGAGGGCCCAGT 
GAGGCGCCAAGGGAAGGTCACCATCAAGTATGACCCCAAGGAGCTACGGAAGCACCTCAA 
CCTAGAGGAGTGGATCCTGGAGCAGCTCACGCGCCTCTACGACTGCCAGGAAGAGGAGAT 
CTCAGAACTAGAGATTGACGTGGATGAGCTCCTGGACATGGAGAGTGACGATGCCTGGGC 
TTCCAGGGTCAAGGAGCTGCTGGTTGACTGTTACAAACCCACAGAGGCCTTCATCTCTGG 
[T.C] 

CTGCTGGACAAGATCCGGGCCATGCAGAAGCTGAGCACACCCCAGAAGAAGTGAGGGTCC 
CCGACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTACCCCCCGACCTCGTAGCAAC 
AGCAATACCGGGGGACCCTGCGGCCAGGCCTGGTTCCAT GAGCAG G GCTCCTCGT GCCCC 
TGGCCCAGGGGTCTCTTCCCCTGCCCCCTCAGnnCCACTTTTGGATTTTTTTATTGTT 
ATTAAACTGATGGGACTTTGTGTTTnATATTGACTCTGCGGCACGGGCCCTTTAATAAA 

GTATGACCCCAAGGAGCTACGGAAGCACCTCAACCTAGAGGAGTGGATCCTGGAGCAGCT 
CACGCGCCTCTACGACTGCCAGGAAGAGGAGATCTCAGAACTAGAGATTGACGTGGATGA 
GCTCCTGGACATGGAGAGTGACGATGCCTGGGCTTCCAGGGTCAAGGAGCTGCTGGTTGA 
CTGTTACAAACCCACAGAGGCCTTCATCTCTGGCCTGCTGGACAAGATCCGGGCCATGCA 
GAAGCTGAGCACACCCCAGAAGAAGTGAGGGTCCCCGACCCAGGCGAACGGTGGCTCCCA 

[T.C] 

AGGACAATCGCTACCCCCCGACCTCGTAGCAACAGCAATACCGGGGGACCCTGCGGCCAG 
GCCTGGTTCCATGAGCAGGGCTCCTCGTGCCCCTGGCCCA6GGGTCTCTTCCCCT GCCCC 
CTCAGTTTTCCACTTTTGGA 1 1 1 1 1 1 1 ATTGTTATTAMCTGATGGGACTTTGTGTTTTT 
ATATTGACTCTGCGGCACGGGCCCTTTAATAAAGCGAGGTAGGGTACGCCTTTGGTGCAG 
CTCAAAAAAAAAAAAAAAMTGATTTC(^GCGGTCCACATTAGAGTTGAMTTTTCTGGT 

GGAAGCACCTCAACCTAGAGGAGTGGATCCTGGAGCAGCTCACGCGCCTCTACGACTGCC 
AGGMGAGGAGATCTCAGAACTAGAGATTGACGTGGATGAGCTCCTGGACATGGAGAGTG 

ACGATGCCTGGGCTTCCAGGGTCAAGGAGCTGCT6GTTGACTGTTACAAACCCACAGAGG 

CCnCATCTCTGGICCTGCTGGACMGATCCGGGCCATGCAGAAGCTGAGCACACCCCAGA 

AGAAGTGAGGGTCCCCGACCCAGGCGAACGGTGGCTCCCATAGGACAATCGCTACCCCCC 



ACCTCGTAGCAACAGCAATACCGGGGGACCCTGCGGCCAGGCCTG GTTCC ATGAGCAGGG 
CTCCTCGTGCCCCTGGCCCAGGGGTCTCTTCCCC TGCCCC CTCAGTTTTCCACTTTTGGA 
1 1 1 1 1 1 lATTGnATTAMCTGATGGGACTTTGTGTTTTTATATTGACTCTGCGGCACGG 
GCCCTTTMTAMGCGAGGTAGGGTACGCCTTTGGTGCAGCTCAAAAAAAAAAAAAAAAA 
TG^TnCCAGCGGTCCACAnAGAGTTGAAATmCTGGTGGGAGAATCTATACCnGTT 

TTGTmCTMTACCTCTTGTCATTCTAMTATCmMmATTAAAAAATATATATAT 

ACAGTAnGMTGCCTACTGTGTGCTAGGTACAGTTCTAAACACTTGGGTTACAGCAGCG 
AACAAAATAAAGGTGCTTACCCTCATAGAACATAGATTCTAGCATGGTATCTAGTGTATC 
ATACAGTAGATA(^TMGTAMCTATATTGAATATTAGAATGTGGCAGATGCTATGGAA 
AMGAGTCMGACMGTAMGACGAnGnCAGGGTACCAGTTGCMTTnAAATATGGT 
[C.T] 

GTCAGAGCAGGCCTCACTGAGGTGACATGACATTTAAGCATAAACATGGAGGAGGAGGAG 
TMGCCTGAGCTGTCTTAGGCTTCCGGGGCAGCCAAGCCATTTCCGTGGCACTAGGAGCC 
TGGTGmCCGATTCCACCmGATMCTGCATmCTCTMGATATGGGAGGGAAGm 
mTCCTAnGTTmMGTATTAACTCCAGCTAGTCCAGCCnGTTATAGTGTTACCTA 
ATCTTTATAGCAAATATATGAGGTACCGGTMCATTATGCCCATTTCTCACAGAGGCACT 
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56825 ACTGATGGCTCAAAGGGTGTGAAAAAGTCAGTGATGCTCCCCCTTTCTACTCCAGATCCT 

GTCCnCCTGGAGCMGGnGAGGGAGTAGGTTTTGAAGAGTCCCTTAATATGTGGTGGA 

ACAGGCCAGGAGTTAGAGAAAGGGCTGGCTTCTGTTTACCTGCTCACTGGCTCTAGCCAG 

CCCAGGGACCACATCMTGTGAGAGGMGCCTCCACCTCATGTTTTCAAACTTAATACTG 

GAGACTGGCTGAGAACTTACGGACAACATCCTTTCTGTCTGAAACAAACAGTCACAAGCA 
[CA] 

AGGAAGAGGCTGGGGGACTAGAAAGAGGCCCTGCCCTCTAGAAAGCTCAGATCTTGGCTT 
CTGnACTCATACTCGGGTGGGCTCCnAGTCAGATGCCTAAAACATnTGCCTAAAGCT 
CGATGGGTTCTGGAGGACAGTGTGGCTTGTCACAGGCCTAGAGTCTGAGGGAGGGGAGTG 
GGAGTCTCAGCAATCTCTTGGTCTTGGCTTCATGGCAACCACTGCTCACCCTTCAACATG 
CCTGGTTTAGGCAGCAGCTTGGGCTGGGAAGAGGTGGTGGCAGAGTCTCAAAGCTGAGAT 

58871 CGTCACCCACCACCCAACCCCTGCCGCACTCCAGCCTTTAACAAGGGCTGTCTAGATATT 

CATTTTMCTACCTCCACCTTGGAMCMTTGCTGAAGGGGAGAGGATTTGCAATGACCA 
ACCACCTTGTTGGGACGCCTGCACACCTGTCTTTCCTGCTTCAACCTGAAAGATTCCTGA 
TGATGATAATCTGGACACAGAAGCCGGGCACGGTGGCTCTAGCCTGTAATCTCAGCACTT 
TGGGAGGCCTCAGCAGGTGGATCACCTGAGATCAAGAGTTrGAGAACAGCCTGACCAACA 
[T.A] 

GGTGAMCCCCGTCTCTACTAAAAATACAAAAA'n'AGCCAGGTGTGGTGGCACATACCTG 
TAATCCCAGCTACTCTGGAGGCTGAGGCAGGAGAATCGCTTGAACCCACAAGGCAGAGGT 
TGCAGTGAGGCGAGATCATGCCATTGCACTCCAGCCTGTGCAACAAGAGCCAAACTCCAT 
CTCAAAAAAAAAAA 
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ISOLATED HUMAN KINASE PROTEINS, 
NUCLEIC ACID MOLECULES ENCODING 
HUMAN KINASE PROTEINS, AND USES 

THEREOF 

FIELD OF THE INVENTION 

The present invention is in the field of kinase proteins that 
are related to the serine/threonine kinase subfamily, recom- 
binant DNA molecules, and protein production. The present 
invention specifically provides novel peptides and proteins 
that effect protein phosphorylation and nucleic acid mol- 
ecules encoding such peptide and protein molecules, all of 
which are useful in the development of human therapeutics 
and diagnostic compositions and methods. 

BACKGROUND OF THE INVENTION 

Protein Kinases 

Kinases regulate many different cell proliferation, 
differentiation, and signaling processes by adding phosphate 
groups to proteins. Uncontrolled signaling has been impli- 
cated in a variety of disease conditions including 
inflammation, cancer, arteriosclerosis, and psoriasis. 
Reversible protein phosphorylation is the main strategy for 
controlling activities of eukaryotic cells. It is estimated that 
more than 1000 of the 10,000 proteins active in a typical 
mammalian cell are phosphorylated. The high energy 
phosphate, which drives activation, is generally transferred 



are characteristic of that subdomain and are highly con- 
served (Hardie, G. and Hanks, S. (1995) The Protein Kinase 
Facts Books, Vol 1:7-20 Academic Press, San Diego, Calif.). 
The second messenger dependent protein kinases prima - 
5 rily mediate the effects of second messengers such as cyclic 
AMP (cAMP), cyclic GMP, inositol triphosphate, 
phosphatidylinositol, 3,4,5-triphosphate, cyclic-ADPribose, 
arachidonic acid, diacylglycerol and calcium-calmodulin. 
The cyclic-AMP dependent protein kinases (PKA) are 
]0 important members of the STK family. Cyclic-AMP is an 
intracellular mediator of hormone action in all prokaryotic 
and animal cells that have been studied. Such hormone- 
induced cellular responses include thyroid hormone 
secretion, Cortisol secretion, progesterone secretion, glyco- 
15 gen breakdown, bone resorption, and regulation of heart rate 
and force of heart muscle contraction. PKA is found in all 
animal cells and is thought to account for the effects of 
cyclic-AMP in most of these cells. Altered PKA expression 
is implicated in a variety of disorders and diseases including 
cancer, thyroid disorders, diabetes, atherosclerosis, and car- 
diovascular disease (Isselbacher, K. J. et al. (1994) Harri- 
son's Principles of Internal Medicine, McGraw-Hill, New 
York, N.Y., pp. 416-431, 1887). 

Calcium-calmodulin (CaM) dependent protein kinases are 
25 also members of STK family. Calmodulin is a calcium 
receptor that mediates many calcium regulated processes by 
binding to target proteins in response to the binding of 
calcium. The principle target protein in these processes is 
CaM dependent protein kinases. CaM -kinases are involved 



20 



from adenosine triphosphate molecules (ATP) to a particular 30 in regulation of smooth muscle contraction (MLC kinase), 
protein by protein kinases and removed from that protein by glycogen breakdown (phosphorylase kinase), and neu- 
protein phosphatases. Phosphorylation occurs in response to retransmission (CaM kinase I and CaM kinase II). CaM 
extracellular signals (hormones, neurotransmitters, growth kinase I phosphorylates a variety of substrates including the 
and differentiation factors, etc), cell cycle checkpoints, and neurotransmitter related proteins synapsin I and II, the gene 
environmental or nutritional stresses and is roughly analo- 35 transcription regulator, CREB, and the cystic fibrosis con- 
gous to turning on a molecular switch. When the switch goes ductance regulator protein, CFTR (Haribabu, B. et al. (1995) 
on, the appropriate protein kinase activates a metabolic EMBO Journal 14:3679-86). CaM II kinase also phospho- 
enzyme, regulatory protein, receptor, cytoskeletal protein, rylates synapsin at different sites, and controls the synthesis 
ion channel or pump, or transcription factor. of catecholamines in the brain through phosphorylation and 
The kinases comprise the largest known protein group, a 40 activation of tyrosine hydroxylase. Many of the CaM 



superfamily of enzymes with widely varied functions and 
specificities. They are usually named after their substrate, 
their regulatory molecules, or some aspect of a mutant 
phenotype. With regard to substrates, the protein kinases 
may be roughly divided into two groups; those that phos- 45 
phorylate tyrosine residues (protein tyrosine kinases, PTK) 
and those that phosphorylate serine or threonine residues 
(serine/threonine kinases, STK). A few protein kinases have 



kinases are activated by phosphorylation in addition to 
binding to CaM. The kinase may autophosphorylate itself, or 
be phosphorylated by another kinase as part of a "kinase 
cascade". 

Another ligand-activated protein kinase is 5'-AMP- 
activated protein kinase (AMPK) (Gao, G. et al. (1996) J. 
Biol Chem. 15:8675-81). Mammalian AMPK is a regulator 
of fatty acid and sterol synthesis through phosphorylation of 
the enzymes acetyl-CoA carboxylase and 



dual specificity and phosphorylate threonine and tyrosine 
residues. Almost all kinases contain a similar 250-300 50 hydroxymethylglutaryl-CoA reductase and mediates 
amino acid catalytic domain. The N-terminal domain, which responses of these pathways to cellular stresses such as heat 
contains subdomains I -IV, generally folds into a two-lobed shock and depletion of glucose and ATP AMPK is a 
structure, which binds and orients the ATP (or GTP) donor heterotimeric complex comprised of a catalytic alpha sub- 
molecule. The larger C terminal lobe, which contains sub- unit and two non-catalytic beta and gamma subunits that are 
domains VI A-XI, binds the protein substrate and carries out 55 believed to regulate the activity of the alpha subunit. Sub- 



the transfer of the gamma phosphate from ATP to the 
hydroxyl group of a serine, threonine, or tyrosine residue. 
Subdomain V spans the two lobes. 

The kinases may be categorized into families by the 
different amino acid sequences (generally between 5 and 
100 residues) located on either side of, or inserted into loops 
of, the kinase domain. These added amino acid sequences 
allow the regulation of each kinase as it recognizes and 
interacts with its target protein. The primary structure of the 



units of AMPK have a much wider distribution in non- 
lipogenic tissues such as brain, heart, spleen, and lung than 
expected. This distribution suggests that its role may extend 
beyond regulation of lipid metabolism alone. 
60 The mitogen-activated protein kinases (MAP) are also 
members of the STK family. MAP kinases also regulate 
intracellular signaling pathways. They mediate signal trans- 
duction from the cell surface to the nucleus via phosphory- 
lation cascades. Several subgroups have been identified, and 



kinase domains is conserved and can be further subdivided 65 each manifests different substrate specificities and responds 
into 11 subdomains. Each of the 11 subdomains contains to distinct extracellular stimuli (Egan, S. E. and Weinberg, 
specific residues and motifs or patterns of amino acids that R. A. (1993) Nature 365:781-783). MAP kinase signaling 
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pathways are present in mammalian cells as well as in yeast. 
The extracellular stimuli that activate mammalian pathways 
include epidermal growth factor (EGF), ultraviolet light, 
hyperosmolar medium, heat shock, endo toxic lipopolysac- 
charide (LPS), and pro-inflammatory cytokines such as 5 
tumor necrosis factor (TNF) and interleukin-1 (IL-1). 

. PRK (proliferation-related kinase) is a serum/cytokine 
inducible STK that is involved in regulation of the cell cycle 
and cell proliferation in human megakaroytic cells (Li, B. et 
al. (1996)7. Biol Chem. 271:19402-8). PRK is related to ™ 
the polo (derived from humans polo gene) family of STKs 
implicated in cell division. PRK is downregulated in lung 
tumor tissue and may be a proto-oncogene whose deregu- 
lated expression in normal tissue leads to oncogenic trans- 
formation. Altered MAP kinase expression is implicated in 15 
a variety of disease conditions including cancer, 
inflammation, immune disorders, and disorders affecting 
growth and development. 

The cyclin-dependent protein kinases (CDKs) are another 
group of STKs that control the progression of cells through 21 
the cell cycle. Cyclins are small regulatory proteins that act 
by binding to and activating CDKs that then trigger various 
phases of the cell cycle by phosphorylating and activating 
selected proteins involved in the mitotic process. CDKs are 
unique in that they require multiple inputs to become 21 
activated. In addition to the binding of cyclin, CDK activa- 
tion requires the phosphorylation of a specific threonine 
residue and the dephosphorylation of a specific tyrosine 
residue. 

3C 

Protein tyrosine kinases, PTKs, specifically phosphory- 
late tyrosine residues on their target proteins and may be 
divided into transmembrane, receptor PTKs and 
non transmembrane, non-receptor PTKs. Transmembrane 
protein-tyrosine kinases are receptors for most growth fac- 35 
tors. Binding of growth factor to the receptor activates the 
transfer of a phosphate group from ATP to selected tyrosine 
side chains of the receptor and other specific proteins. 
Growth factors (GF) associated with receptor PTKs include; 
epidermal GF, platelet-derived GF, fibroblast GF, hepatocyte 4Q 
GF, insulin and insulin-like GFs, nerve GF, vascular endot- 
helial GF, and macrophage colony stimulating factor. 

Non-receptor PTKs lack transmembrane regions and, 
instead, form complexes with the intracellular regions of cell 
surface receptors. Such receptors that function through non- 45 
receptor PTKs include those for cytokines, hormones 
(growth hormone and prolactin) and antigen-specific recep- 
tors on T and B lymphocytes. 

Many of these PTKs were first identified as the products 
of mutant oncogenes in cancer cells where their activation 50 
was no longer subject to normal cellular controls. In fact, 
about one third of the known oncogenes encode PTKs, and 
it is well known that cellular transformation (oncogenesis) is 
often accompanied by increased tyrosine phosphorylation 
activity (Carbonneau H and Tonks NK (1992) Annu. Rev. 55 
Cell Biol 8:463-93). Regulation of PTK activity may 
therefore be an important strategy in controlling some types 
of cancer. 

LIM Domain Kinases 

oU 

The novel human protein, and encoding gene, provided by 
the present invention is related to the family of serine/ 
threonine kinases in general, particularly LIM domain 
kinases (LIMK), and shows the highest degree of similarity 
to LIMK2, and the UMK2b isoforn (Genbank gi8051618) 65 
in particular (see the amino acid sequence alignment of the 
protein of the present invention against LIMK2b provided in 
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FIG. 2). LIMK proteins generally have serine/threonine 
kinase activity. The protein of the present invention may be 
a novel alternative splice form of the art-known protein 
provided in Genbank gi805161 ; however, the structure of 
the gene provided by the present invention is different from 
the art-known gene of gi805l618 and the first exon of the 
gene of the present invention is novel, suggesting a novel 
gene rather than an alternative splice form. Furthermore, the 
protein of the present invention lacks an LIM domain 
relative to gi8051618. The protein of the present invention 
does contain the kinase catalytic domain. 

Approximately 40 LIM proteins, named for the LIM 
domains they contain, are known to exist in eukaryotes. LIM 
domains are conserved, cystein-rich structures that contain 2 
zinc fingers that are thought to modulate protein-protein 
interactions. LIMK1 and LIMK2 are members of a LIM 
subfamily characterized by 2 N-terminal LIM domains and 
a C-terminal protein kinase domain. LIMK1 and LIMK2 
mRNA expression varies greatly between different tissues. 
The protein kinase domains of LIMK1 and LIMK2 contain 
a unique sequence motif comprising Asp-Leu-Asn-Ser-His- 
Asn in subdomain VI B and a strongly basic insert between 
subdomains VII and VIII (Okano et al.,/. Biol Chem. 270 
(52), 31321-31330 (1995)). The protein kinase domain 
present in LIMKs is significantly different than other kinase 
domains, sharing about 32% identity. 

LIMK is activated by ROCK (a downstream effector of 
Rho) via phosphorylation. LIMK then phosphorylates 
cofilin, which inhibits its actin-depolymerizing activity, 
thereby leading to Rho-induced reorganization of the actin 
cytoskeleton (Maekawa et al., Science 285: 895-898, 1999). 

The LIMK2a and LIMK2b alternative transcript forms are 
differentially expressed in a tissue-specific manner and are 
generated by variation in transcriptional initiation utilizing 
alternative promoters. IiMK2a contains 2 LIM domains, a 
PDZ domain (a domain that functions in protein-protein 
interactions targeting the protein to the submembranous 
compartment), and a kinase domain; whereas LIMK2b just 
has 1.5 LIM domains. Alteration of LIMK2a and LIMK2b 
regulation has been observed in some cancer cell lines 
(Osada et al., Biochem. Biophys. Res. Commun. 229: 
582-589, 1996). 

For a further review of LIMK proteins, see Nomoto et at, 
Gene 236 (2), 259-271 (1999). 

Kinase proteins, particularly members of the serine/ 
threonine kinase subfamily, are a major target for drug 
action and development. Accordingly, it is valuable to the 
field of pharmaceutical development to identify and char- 
acterize previously unknown members of this subfamily of 
kinase proteins. The present invention advances the state of 
the art by providing previously unidentified human kinase 
proteins that have homology to members of the serine/ 
threonine kinase subfamily. 

SUMMARY OF THE INVENTION 

The present invention is based in part on the identification 
of amino acid sequences of human kinase peptides and 
proteins that are related to the serine/threonine kinase 
subfamily, as well as allelic variants and other mammalian 
orthologs thereof. These unique peptide sequences, and 
nucleic acid sequences that encode these peptides, can be 
used as models for the development of human therapeutic 
targets, aid in the identification of therapeutic proteins, and 
serve as targets for the development of human therapeutic 
agents that modulate kinase activity in cells and tissues that 
express the kinase. Experimental data as provided in FIG. 1 
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indicates expression in humans in teratocarcinoma, ovary, 
testis, nervous tissue, bladder, infant and fetal brain, and 
thyroid gland. 

DESCRIPTION OF THE FIGURE SHEETS 

FIG. 1 provides the nucleotide sequence of a cDNA 
molecule that encodes the kinase protein of the present 
invention. (SEQ ED NO:l) In addition, structure and func- 
tional information is provided, such as ATG start, stop and 
tissue distribution, where available, that allows one to 
readily determine specific uses of inventions based on this 
molecular sequence. Experimental data as provided in FIG. 
1 indicates expression in humans in teratocarcinoma, ovary, 
testis, nervous tissue, bladder, infant and fetal brain, and 
thyroid gland. 

FIG. 2 provides the predicted amino acid sequence of the 
kinase of the present invention. (SEQ ID NO:2) In addition 
structure and functional information such as protein family, 
function, and modification sites is provided where available, 
allowing one to readily determine specific uses of inventions 
based on this molecular sequence. 

FIG. 3 provides genomic sequences that span the gene 
encoding the kinase protein of the present invention. (SEQ 
ID NO:3) In addition structure and functional information, 
such as intron/exon structure, promoter location, etc., is 
provided where available, allowing one to readily determine 
specific uses of inventions based on this molecular 
sequence. As illustrated in FIG. 3, SNPs were identified at 
42 different nucleotide positions. 

DETAILED DESCRIPTION OF THE 
INVENTION 

General Description 

The present invention is based on the sequencing of the 
human genome. During the sequencing and assembly of the 
human genome, analysis of the sequence information 
revealed previously unidentified fragments of the human 
genome that encode peptides that share structural and/or 
sequence homology to protein/peptide/domains identified 
and characterized within the art as being a kinase protein or 
part of a kinase protein and are related to the serine/ 
threonine kinase subfamily. Utilizing these sequences, addi- 
tional genomic sequences were assembled and transcript 
and/or cDNA sequences were isolated and characterized. 
Based on this analysis, the present invention provides amino 
acid sequences of human kinase peptides and proteins that 
are related to the serine/threonine kinase subfamily, nucleic 
acid sequences in the form of transcript sequences, cDNA 
sequences and/or genomic sequences that encode these 
kinase peptides and proteins, nucleic acid variation (allelic 
information), tissue distribution of expression, and informa- 
tion about the closest art known protein/peptide/domain that 
has structural or sequence homology to the kinase of the 
present invention. 

In addition to being previously unknown, the peptides that 
are provided in the present invention are selected based on 
their ability to be used for the development of commercially 
important products and services. Specifically, the present 
peptides are selected based on homology and/or structural 
relatedness to known kinase proteins of the serine/threonine 
kinase subfamily and the expression pattern observed. 
Experimental data as provided in FIG. 1 indicates expres- 
sion in humans in teratocarcinoma, ovary, testis, nervous 
tissue, bladder, infant and fetal brain, and thyroid gland. The 
art has clearly established the commercial importance of 
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members of this family of proteins and proteins that have 
expression patterns similar to that of the present gene. Some 
of the more specific features of the peptides of the present 
invention, and the uses thereof, are described herein, par- 
5 ticularly in the Background of the Invention and in the 
annotation provided in the Figures, and/or are known within 
the art for each of the known serine/threonine kinase family 
or subfamily of kinase proteins. 

!0 Specific Embodiments 

Peptide Molecules 

The present invention provides nucleic acid sequences 
that encode protein molecules that have been identified as 
15 being members of the kinase family of proteins and are 
related to the serine/threonine kinase subfamily (protein 
sequences are provided in FIG. 2, transcript/cDNA 
sequences are provided in FIG. 1 and genomic sequences are 
provided in FIG. 3). The peptide sequences provided in FIG. 
20 2, as well as the obvious variants described herein, particu- 
larly allelic variants as identified herein and using the 
information in FIG. 3, will be referred herein as the kinase 
peptides of the present invention, kinase peptides, or 
pep tides/proteins of the present invention. 

The present invention provides isolated peptide and pro- 
tein molecules that consist of, consist essentially of, or 
comprise the amino acid sequences of the kinase peptides 
disclosed in the FIG. 2, (encoded by the nucleic acid 
molecule shown in FIG. 1, transcript/cDNA or FIG. 3, 
genomic sequence), as well as all obvious variants of these 
peptides that are within the art to make and use. Some of 
these variants are described in detail below. 

As used herein, a peptide is said to be "isolated" or 
"purified" when it is substantially free of cellular material or 
free of chemical precursors or other chemicals. The peptides 
of the present invention can be purified to homogeneity or 
other degrees of purity. The level of purification will be 
based on the intended use. The critical feature is that the 
preparation allows for the desired function of the peptide, 
even if in the presence of considerable amounts of other 
components (the features of an isolated nucleic acid mol- 
ecule is discussed below). 

In some uses, "substantially free of cellular material" 
45 includes preparations of the peptide having less than about 
30% (by dry weight) other proteins (i.e., contaminating 
protein), less than about 20% other proteins, less than about 
10% other proteins, or less than about 5% other proteins. 
When the peptide is recombinantly produced, it can also be 
substantially free of culture medium, i.e., culture medium 
represents less than about 20% of the volume of the protein 
preparation. 

The language "substantially free of chemical precursors 
or other chemicals" includes preparations of the peptide in 
55 which it is separated from chemical precursors or other 
chemicals that are involved in its synthesis. In one 
embodiment, the language "substantially free of chemical 
precursors or other chemicals" includes preparations of the 
kinase peptide having less than about 30% (by dry weight) 
go chemical precursors or other chemicals, less than about 20% 
chemical precursors or other chemicals, less than about 10% 
chemical precursors or other chemicals, or less than about 
5% chemical precursors or other chemicals. 

The isolated kinase peptide can be purified from cells that 
65 naturally express it, purified from cells that have been 
altered to express it (recombinant), or synthesized using 
known protein synthesis methods. Experimental data as 
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provided in FIG. 1 indicates expression in humans in A chimeric or fusion protein can be produced by standard 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant recombinant DNA techniques. For example, DNA fragments 
and fetal brain, and thyroid gland. For example, a nucleic coding for the different protein sequences are ligated 
acid molecule encoding the kinase peptide is cloned into an together in-frame in accordance with' conventional tech- 
expression vector, the expression vector introduced into a 5 niques. In another embodiment, the fusion gene can be 
host cell and the protein expressed in the host cell. The synthesized by conventional techniques including auto- 
protein can then be isolated from the cells by an appropriate mated DNA synthesizers. Alternatively, PCR amplification 
purification scheme using standard protein purification tech- of gene fragments can be carried out using anchor primers 
niques. Many of these techniques are described in detail which give rise to complementary overhangs between two 
below. 30 consecutive gene fragments which can subsequently be 

Accordingly, the present invention provides proteins that annealed and re-amplified to generate a chimeric gene 

consist of the amino acid sequences provided in FIG. 2 (SEQ sequence (see Ausubel et al., Current Protocols in Molecu- 

ID NO:2), for example, proteins encoded by the transcript/ lar Biology, 1992). Moreover, many expression vectors are 

cDNA nucleic acid sequences shown in FIG. 1 (SEQ ID commercially available that already encode a fusion moiety 

NO: 1) and the genomic sequences provided in FIG. 3 (SEQ (e.g., a GST protein), A kinase peptide-encoding nucleic 

ID NO:3). The amino acid sequence of such a protein is acid can be cloned into such an expression vector such that 

provided in FIG. 2. A protein consists of an amino acid the fusion moiety is linked in-frame to the kinase peptide, 

sequence when the amino acid sequence is the final amino As mentioned above, the present invention also provides 

acid sequence of the protein. and enables obvious variants of the amino acid sequence of 

The present invention further provides proteins that con- 2Q the proteins of the present invention, such as naturally 

sist essentially of the amino acid sequences provided in FIG. occurring mature forms of the peptide, allelic/sequence 

2 (SEQ ID NO:2), for example, proteins encoded by the variants of the peptides, non-naturally occurring recombi- 
transcript/cDNA nucleic acid sequences shown in FIG. 1 nantly derived variants of the peptides, and orthologs and 
(SEQ ID NO:l) and the genomic sequences provided in FIG. paralogs of the peptides. Such variants can readily be 

3 (SEQ ID NO: 3). A protein consists essentially of an amino ^ generated using art-known techniques in the fields of recom- 
acid sequence when such an amino acid sequence is present binant nucleic acid technology and protein biochemistry. It 
with only a few additional amino acid residues, for example is understood, however, that variants exclude any amino acid 
from about 1 to about 100 or so additional residues, typically sequences disclosed prior to the invention. 

from 1 to about 20 additional residues in the final protein. Such variants can readily be identified/made using 

The present invention further provides proteins that com- 30 molecular techniques and the sequence information dis- 

prise the amino acid sequences provided in FIG. 2 (SEQ ID closed herein. Further, such variants can readily be distin- 

NO:2), for example, proteins encoded by the transcript/ guished from other peptides based on sequence and/or 

cDNA nucleic acid sequences shown in FIG. 1 (SEQ ID structural homology to the kinase peptides of the present 

NO:l) and the genomic sequences provided in FIG. 3 (SEQ invention. The degree of homology/identity present will be 

ID NO:3). A protein comprises an amino acid sequence 35 based primarily on whether the peptide is a functional 

when the amino acid sequence is at least part of the final variant or non-functional variant, the amount of divergence 

amino acid sequence of the protein. In such a fashion, the present in the paralog family and the evolutionary distance 

protein can be only the peptide or have additional amino acid between the orthologs. 

molecules, such as amino acid residues (contiguous encoded To determine the percent identity of two amino acid 

sequence) that are naturally associated with it or heterolo- 40 sequences or two nucleic acid sequences, the sequences are 

gous amino acid residues/peptide sequences. Such a protein aligned for optimal comparison purposes (e.g., gaps can be 

can have a few additional amino acid residues or can introduced in one or both of a first and a second amino acid 

comprise several hundred or more additional amino acids. or nucleic acid sequence for optimal alignment and non- 

The preferred classes of proteins that are comprised of the homologous sequences can be disregarded for comparison 

kinase peptides of the present invention are the naturally 45 purposes). In a preferred embodiment, at least 30%, 40%, 

occurring mature proteins. A brief description of how van- 50%, 60%, 70%, 80%, or 90% or more of the length of a 

ous types of these proteins can be made/isolated is provided reference sequence is aligned for comparison purposes. The 

below. amino acid residues or nucleotides at corresponding amino 

The kinase peptides of the present invention can be acid positions or nucleotide positions are then compared, 

attached to heterologous sequences to form chimeric or 50 When a position in the first sequence is occupied by the 

fusion proteins. Such chimeric and fusion proteins comprise same amino acid residue or nucleotide as the corresponding 

a kinase peptide operatively linked to a heterologous protein position in the second sequence, then the molecules are 

having an amino acid sequence not substantially homolo- identical at that position (as used herein amino acid or 

gous to the kinase peptide. "Operatively linked" indicates nucleic acid "identity" is equivalent to amino acid or nucleic 

that the kinase peptide and the heterologous protein are 55 acid "homology"). The percent identity between the two 

fused in-frame. The heterologous protein can be fused to the sequences is a function of the number of identical positions 

N-terminus or C-terminus of the kinase peptide. shared by the sequences, taking into account the number of 

In some uses, the fusion protein does not affect the gaps, and the length of each gap, which need to be intro- 

activity of the kinase peptide per se. For example, the fusion duced for optimal alignment of the two sequences, 

protein can include, but is not limited to, enzymatic fusion 60 The comparison of sequences and determination of per- 

proteins, for example beta-galactosidase fusions, yeast two- cent identity and similarity between two sequences can be 

hybrid GAL fusions, poly-His fusions, MYC-tagged, accomplished using a mathematical algorithm. 

Hl-tagged and Ig fusions. Such fusion proteins, particularly (Computational Molecular Biology, Lesk, A. M., ed., 

poly-His fusions, can facilitate the purification of recombi- Oxford University Press, New York, 1988; Biocomputing: 

nant kinase peptide. In certain host cells (e.g., mammalian 65 Informatics and Genome Projects, Smith, D. W., ed., Aca- 

host cells), expression and/or secretion of a protein can be demic Press, New York, 1993; Computer Analysis of 

increased by using a heterologous signal sequence. Sequence Data, Part 1, Griffin, A. M., and Griffin, H. G., 
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eds., Humana Press, New Jersey, 1994; Sequence Analysis 
in Molecular Biology, von Heinje, G., Academic Press, 
1987; and Sequence Analysis Primer, Gribskov, M. and 
Devereux, J., eds., M Stockton Press, New York, 1991). In 
a preferred embodiment, the percent identity between two 
amino acid sequences is determined using the Needleman 
and Wunsch (J. Mol Biol. (48):444-453 (1970)) algorithm 
which has been incorporated into the GAP program in the 
GCG software package (available at http://www.gcg.com), 
using either a Blossom 62 matrix or a PAM250 matrix, and 
a gap weight of 16, 14, 12, 10, 8, 6, or 4 and a length weight 
of 1, 2, 3, 4, 5, or 6. In yet another preferred embodiment, 
the percent identity between two nucleotide sequences is 
determined using the GAP program in the GCG software 
package (Devereux, J., et al., Nucleic Acids Res. 12(1):387 
(1984)) (available at http://www.gcg.com), using a NWS- 
gapdna.CMP matrix and a gap weight of 40, 50, 60, 70, or 
80 and a length weight of 1, 2, 3, 4, 5, or 6. In another 
embodiment, the percent identity between two amino acid or 
nucleotide sequences is determined using the algorithm of E. 
Myers and W. Miller (CABIOS, 4:11-17 (1989)) which has 
been incorporated into the ALIGN program (version 2.0), 
using a PAM120 weight residue table, a gap length penalty 
of 12 and a gap penalty of 4. 

The nucleic acid and protein sequences of the present 
invention can further be used as a "query sequence" to 
perform a search against sequence databases to, for example, 
identify other family members or related sequences. Such 
searches can be performed using the NBLAST and 
XBLAST programs (version 2.0) of Altschul, et al. (J. Mol. 
Biol 215:403-10 (1990)). BLAST nucleotide searches can 
be performed with the NBLAST program, score=100, 
wordlength-12 to obtain nucleotide sequences homologous 
to the nucleic acid molecules of the invention. BLAST 
protein searches can be performed with the XBLAST 
program, score=50, wordlength»3 to obtain amino acid 
sequences homologous to the proteins of the invention. To 
obtain gapped alignments for comparison purposes, Gapped 
BLAST can be utilized as described in Altschul et al. 
{Nucleic Acids Res. 25(17):3389-3402 (1997)). When uti- 
lizing BLAST and gapped BLAST programs, the default 
parameters of the respective programs (e.g., XBLAST and 
NBLAST) can be used. 

Full-length pre-processed forms, as well as mature pro- 
cessed forms, of proteins that comprise one of the peptides 
of the present invention can readily be identified as having 
complete sequence identity to one of the kinase peptides of 
the present invention as well as being encoded by the same 
genetic locus as the kinase peptide provided herein. The 
gene encoding the novel kinase protein of the present 
invention is located on a genome component that has been 
mapped to human chromosome 22 (as indicated in FIG. 3), 
which is supported by multiple lines of evidence, such as 
STS and BAC map data. 

Allelic variants of a kinase peptide can readily be iden- 
tified as being a human protein having a high degree 
(significant) of sequence homology/identity to at least a 
portion of the kinase peptide as well as being encoded by the 
same genetic locus as the kinase peptide provided herein. 
Genetic locus can readily be determined based on the 
genomic information provided in FIG. 3, such as the 
genomic sequence mapped to the reference human. The gene 
encoding the novel kinase protein of the present invention is 
located on a genome component that has been mapped to 
human chromosome 22 (as indicated in FIG. 3), which is 
supported by multiple lines of evidence, such as STS and 
BAC map data. As used herein, two proteins (or a region of 
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the proteins) have significant homology when the amino 
acid sequences are typically at least about 70-^80%, 80-90%, 
and more typically at least about 90-95% or more homolo- 
gous. A significantly homologous amino acid sequence, 
according to the present invention, will be encoded by a 
nucleic acid sequence that will hybridize to a kinase peptide 
encoding nucleic acid molecule under stringent conditions 
as more fully described below. 

FIG. 3 provides information on SNPs that have been 
found in the gene encoding the kinase protein of the present 
invention. SNPs were identified at 42 different nucleotide 
positions. Some of these SNPs, which are located outside the 
ORF and in introns, may affect gene transcription. 

Paralogs of a kinase peptide can readily be identified as 
having some degree of significant sequence homology/ 
identity to at least a portion of the kinase peptide, as being 
encoded by a gene from humans, and as having similar 
activity or function. Two proteins will typically be consid- 
ered paralogs when the amino acid sequences are typically 
at least about 60% or greater, and more typically at least 
about 70% or greater homology through a given region or 
domain. Such paralogs will be encoded by a nucleic acid 
sequence that will hybridize to a kinase peptide encoding 
nucleic acid molecule under moderate to stringent condi- 
tions as more fully described below. 

Orthologs of a kinase peptide can readily be identified as 
having some degree of significant sequence homology/ 
identity to at least a portion of the kinase peptide as well as 
being encoded by a gene from another organism. Preferred 
orthologs will be isolated from mammals, preferably 
primates, for the development of human therapeutic targets 
and agents. Such orthologs will be encoded by a nucleic acid 
sequence that will hybridize to a kinase peptide encoding 
nucleic acid molecule under moderate to stringent 
conditions, as more fully described below, depending on the 
degree of relatedness of the two organisms yielding the 
proteins. 

Non-naturally occurring variants of the kinase peptides of 
the present invention can readily be generated using recom- 
binant techniques. Such variants include, but are not limited 
to deletions, additions and substitutions in the amino acid 
sequence of the kinase peptide. For example, one class of 
substitutions are conserved amino acid substitution. Such 
substitutions are those that substitute a given amino acid in 
a kinase peptide by another amino acid of like characteris- 
tics. Typically seen as conservative substitutions are the 
replacements, one for another, among the aliphatic amino 
acids Ala, Val, Leu, and He; interchange of the hydroxyl 
residues Ser and Thr; exchange of the acidic residues Asp 
and Glu; substitution between the amide residues Asn and 
Gin; exchange of the basic residues Lys and Arg; and 
replacements among the aromatic residues Phe and Tyr. 
Guidance concerning which amino acid changes are likely to 
be phenotypically silent are found in Bowie et al., Science 
247:1306-1310 (1990). 

Variant kinase peptides can be fully functional or can lack 
function in one or more activities, e.g. ability to bind 
substrate, ability to phosphorylate substrate, ability to medi- 
ate signaling, etc. Fully functional variants typically contain 
only conservative variation or variation in non-critical resi- 
dues or in non-critical regions. FIG. 2 provides the result of 
protein analysis and can be used to identify critical domains/ 
regions. Functional variants can also contain substitution of 
similar amino acids that result in no change or an insignifi- 
cant change in function. Alternatively, such substitutions 
may positively or negatively affect function to some degree. 
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Non-functional variants typically contain one or more 
non-conservative amino acid substitutions, deletions, 
insertions, inversions, or truncation or a substitution, 
insertion, 'inversion, or deletion in a critical residue or 
critical region. 5 

Amino acids that are. essential for function can be iden- 
tified by methods known in the art, such as site-directed 
mutagenesis or alanine-scanning mutagenesis (Cunningham 
et al. f Science 244:1081-1085 (1989)), particularly using the 
results provided in FIG. 2. The latter procedure introduces 30 
single alanine mutations at every residue in the molecule. 
The resulting mutant molecules are then tested for biological 
activity such as kinase activity or in assays such as an in 
vitro proliferative activity. Sites that are critical for binding 
partner/substrate binding can also be determined by struc- 15 
tural analysis such as crystallization, nuclear magnetic reso- 
nance or photoafhnity labeling (Smith et al., /. Mol Biol. 
224:899-904 (1992); de Vos et al. Science 255:306-312 
(1992)). 

The present invention further provides fragments of the 
kinase peptides, in addition to proteins and peptides that 
comprise and consist of such fragments, particularly those 
comprising the residues identified in FIG. 2. The fragments 
to which the invention pertains, however, are not to be 
construed as encompassing fragments that may be disclosed 25 
publicly prior to the present invention. 

As used herein, a fragment comprises at least 8, 10, 12, 
14, 16, or more contiguous amino acid residues from a 
kinase peptide. Such fragments can be chosen based on the 3Q 
ability to retain one or more of the biological activities of the 
kinase peptide or could be chosen for the ability to perform 
a function, e.g. bind a substrate or act as an immunogen. 
Particularly important fragments are biologically active 
fragments, peptides that are, for example, about 8 or more 35 
amino acids in length. Such fragments will typically com- 
prise a domain or motif of the kinase peptide, e.g., active 
site, a transmembrane domain or a substrate-binding 
domain. Further, possible fragments include, but are not 
limited to, domain or motif containing fragments, soluble 4Q 
peptide fragments, and fragments containing immunogenic 
structures. Predicted domains and functional sites are readily 
identifiable by computer programs well known and readily 
available to those of skill in the art (e.g., PROSITE analysis). 
The results of one such analysis are provided in FIG. 2. 45 

Polypeptides often contain amino acids other than the 20 
amino acids commonly referred to as the 20 naturally 
occurring amino acids. Further, many amino acids, including 
the terminal amino acids, may be modified by natural 
processes, such as processing and other post-translational 50 
modifications, or by chemical modification techniques well 
known in the art. Common modifications that occur natu- 
rally in kinase peptides are described in basic texts, detailed 
monographs, and the research literature, and they are well 
known to those of skill in the art (some of these features are 55 
identified in FIG. 2). 

Known modifications .include, but are not limited to, 
acetylation, acylation, ADP-ribosylation, amidation, cova- 
lent attachment of flavin, covalent attachment of a heme 
moiety, covalent attachment of a nucleotide or nucleotide 60 
derivative, covalent attachment of a lipid or lipid derivative, 
covalent attachment of phosphotidylinositol, cross-linking, 
cyclization, disulfide bond formation, demetbylation, for- 
mation of covalent crosslinks, formation of cystine, forma- 
tion of pyroglutamate, formylation, gamma carboxylation, 65 
glycosylation, GPI anchor formation, hydroxylation, 
iodination, methylation, myristoylation, oxidation, pro- 



teolytic processing, phosphorylation, prenylation, 
racemization, selenoylation, sulfation, transfer-RNA medi- 
ated addition of amino acids to proteins such as arginylation, 
and ubiquitination. 

Such modifications are well known to those of skill in the 
art and have been described in great detail in the scientific 
literature. Several particularly common modifications, 
glycosylation, lipid attachment, sulfation, gamma- 
carboxylation of glutamic acid residues, hydroxylation and 
ADP-ribosylation, for instance, are described in most basic 
texts, such as Proteins — Structure and Molecular 
Properties, 2nd Ed., T. E. Creighton, W. H. Freeman and 
Company, New York (1993). Many detailed reviews are 
available on this subject, such as by Wold, F., Posttransla- 
tional Covalent Modification of Proteins, B. C. Johnson, 
Ed., Academic Press, New York 1-12 (1983); Seifter et al. 
(Metk Enzymol 182: 626-646 (1990)) and Rattan et al. 
(/inn. NX Acad. Sci. 663:48-62 (1992)). 

Accordingly, the kinase peptides of the present invention 
also encompass derivatives or analogs in which a substituted 
amino acid residue is not one encoded by the genetic code, 
in which a substituent group is included, in which the mature 
kinase peptide is fused with another compound, such as a 
compound to increase the half -life of the kinase peptide (for 
example, polyethylene glycol), or in which the additional 
amino acids are fused to the mature kinase peptide, such as 
a leader or secretory sequence or a sequence for purification 
of the mature kinase peptide or a pro-protein sequence. 

Protein/Peptide Uses 

The proteins of the present invention can be used in 
substantial and specific assays related to the functional 
information provided in the Figures; to raise antibodies or to 
elicit another immune response; as a reagent (including the 
labeled reagent) in assays designed to quantitatively deter- 
mine levels of the protein (or its binding partner or ligand) 
in biological fluids; and as markers for tissues in which the 
corresponding protein is preferentially expressed (either 
constitutively or at a particular stage of tissue differentiation 
or development or in a disease state). Where the protein 
binds or potentially binds to another protein or ligand (such 
as, for example, in a kinase-effector protein interaction or 
kinase-ligand interaction), the protein can be used to identify 
the binding partner/ligand so as to develop a system to 
identify inhibitors of the binding interaction. Any or all of 
these uses are capable of being developed into reagent grade 
or kit format for commercialization as commercial products. 

Methods for performing the uses listed above are well 
known to those skilled in the art. References disclosing such 
methods include "Molecular Cloning: A Laboratory 
Manual", 2d ed., Cold Spring Harbor Laboratory Press, 
Sambrook, J., E. F. Fritsch and T. Maniatis eds., 1989, and 
"Methods in Enzymology: Guide to Molecular Cloning 
Techniques", Academic Press, Berger, S. L. and A. R. 
Kimmel eds., 1987. 

The potential uses of the peptides of the present invention 
are based primarily on the source of the protein as well as the 
class/action of the protein. For example, kinases isolated 
from humans and their human/mammalian orthologs serve 
as targets for identifying agents for use in mammalian 
therapeutic applications, e.g. a human drug, particularly in 
modulating a biological or pathological response in a cell or 
tissue that expresses the kinase. Experimental data as pro- 
vided in FIG. 1 indicates that the kinase proteins of the 
present invention are expressed in humans in 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 
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brain, and thyroid gland, as indicated by virtual northern blot 
analysis. In addition, PCR-based tissue screening panels 
indicate expression in fetal brain. A large percentage of 
pharmaceutical agents are being developed that modulate 
the activity of kinase proteins, particularly members of the 5 
serine/threonine kinase subfamily (see Background of the 
Invention). The structural and functional information pro- 
vided in the Background and Figures provide specific and 
substantial uses for the molecules of the present invention, 
particularly in combination with the expression information 
provided in FIG. 1. Experimental data as provided in FIG. 
1 indicates expression in humans in teratocarcinoma, ovary, 
testis, nervous tissue, bladder, infant and fetal brain, and 
thyroid gland. Such uses can readily be determined using the 
information provided herein, that which is known in the art, 
and routine experimentation. 

The proteins of the present invention (including variants 
and fragments that may have been disclosed prior to the 
present invention) are useful for biological assays related to 
kinases that are related to members of the serine/threonine 2C 
kinase subfamily. Such assays involve any of the known 
kinase functions or activities or properties useful for diag- 
nosis and treatment of kinase-related conditions that are 
specific for the subfamily of kinases that the one of the 
present invention belongs to, particularly in cells and tissues 25 
that express the kinase. Experimental data as provided in 
FIG. 1 indicates that the kinase proteins of the present 
invention are expressed in humans in teratocarcinoma, 
ovary, testis, nervous tissue, bladder, infant brain, and thy- 
roid gland, as indicated by virtual northern blot analysis. In 30 
addition, PCR-based tissue screening panels indicate expres- 
sion in fetal brain. 

The proteins of the present invention are also usefull in 
drug screening assays, in cell-based or cell-free systems. 
Cell-based systems can be native, i.e., cells that normally 35 
express the kinase, as a biopsy or expanded in cell culture. 
Experimental data as provided in FIG. 1 indicates expres- 
sion in humans in teratocarcinoma, ovary, testis, nervous 
tissue, bladder, infant and fetal brain, and thyroid gland. In 
an alternate embodiment, cell-based assays involve recom- 40 
binant host cells expressing the kinase protein. 

The polypeptides can be used to identify compounds that 
modulate kinase activity of the protein in its natural state or 
an altered form that causes a specific disease or pathology 
associated with the kinase. Both the kinases of the present 45 
invention and appropriate variants and fragments can be 
used in high-throughput screens to assay candidate com- 
pounds for the ability to bind to the kinase. These com- 
pounds can be further screened against a functional kinase to 
determine the effect of the compound on the kinase activity. 50 
Further, these compounds can be tested in animal or inver- 
tebrate systems to determine activity/effectiveness. Com- 
pounds can be identified that activate (agonist) or inactivate 
(antagonist) the kinase to a desired degree. 

Further, the proteins of the present invention can be used 55 
to screen a compound for the ability to stimulate or inhibit 
interaction between the kinase protein and a molecule that 
normally interacts with the kinase protein, e.g. a substrate or 
a component of the signal pathway that the kinase protein 
normally interacts (for example, another kinase). Such 60 
assays typically include the steps of combining the kinase 
protein with a candidate compound under conditions that 
allow the kinase protein, or fragment, to interact with the 
target molecule, and to detect the formation of a complex 
between the protein and the target or to detect the biochemi- 65 
cal consequence of the interaction with the kinase protein 
and the target, such as any of the associated effects of signal 
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transduction such as protein phosphorylation, cAMP 
turnover, and adenylate cyclase activation, etc. 

Candidate compounds include, for example, 1) peptides 
such as soluble peptides, including Ig-tailed fusion peptides 
and members of random peptide libraries (see, e.g., Lam et 
al., Nature 354:82-84 (1991); Hougbten et al., Nature 
354:84-86 (1991)) and combinatorial chemistry-derived 
molecular libraries made of D- and/or L-configuration 
amino acids; 2) phosphopeptides (e.g., members of random 
and partially degenerate, directed phosphopeptide libraries, 
see, e.g., Songyang et al, Cell 72:767-778 (1993)); 3) 
antibodies (e.g., polyclonal, monoclonal, humanized, anti- 
idiotypic, chimeric, and single chain antibodies as well as 
Fab, F(ab') 2 , Fab expression library fragments, and epitope- 
binding fragments of antibodies); and 4) small organic and 
inorganic molecules (e.g., molecules obtained from combi- 
natorial and natural product libraries). 

One candidate compound is a soluble fragment of the 
receptor that competes for substrate binding. Other candi- 
date compounds include mutant kinases or appropriate frag- 
ments containing mutations that affect kinase function and 
thus compete for substrate. Accordingly, a fragment that 
competes for substrate, for example with a higher affinity, or 
a fragment that binds substrate but does not allow release, is 
encompassed by the invention. 

The invention further includes other end point assays to 
identify compounds that modulate (stimulate or inhibit) 
kinase activity. The assays typically involve an assay of 
events in the signal transduction pathway that indicate 
kinase activity. Thus, the phosphorylation of a substrate, 
activation of a protein, a change in the expression of genes 
that are up- or down-fegulated in response to the kinase 
protein dependent signal cascade can be assayed. 

Any of the biological or biochemical functions mediated 
by the kinase can be used as an endpoint assay. These 
include all of the biochemical or biochemical/biological 
events described herein, in the references cited herein, 
incorporated by reference for these endpoint assay targets, 
and other functions known to those of ordinary skill in the 
art or that can be readily identified using the information 
provided in the Figures, particularly FIG. 2. Specifically, a 
biological function of a cell or tissues that expresses the 
kinase can be assayed. Experimental data as provided in 
FIG. 1 indicates that the kinase proteins of the present 
invention are expressed in humans in teratocarcinoma, 
ovary, testis, nervous tissue, bladder, infant brain, and thy- 
roid gland, as indicated by virtual northern blot analysis. In 
addition, PCR-based tissue screening panels indicate expres- 
sion in fetal brain. 

Binding and/or activating compounds can also be 
screened by using chimeric kinase proteins in which the 
amino terminal extracellular domain, or parts thereof, the 
entire transmembrane domain or subregions, such as any of 
the seven transmembrane segments or any of the intracel- 
lular or extracellular loops and the carboxy terminal intra- 
cellular domain, or parts thereof, can be replaced by heter- 
ologous domains or subregions. For example, a substrate- 
binding region can be used that interacts with a different 
substrate then that which is recognized by the native kinase. 
Accordingly, a different set of signal transduction compo- 
nents is available as an end-point assay for activation. This 
allows for assays to be performed in other than the specific 
host cell from which the kinase is derived. 

The proteins of the present invention are also useful in 
competition binding assays in methods designed to discover 
compounds that interact with the kinase (e.g. binding part- 
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ners and/or ligands). Thus, a compound is exposed to a 
kinase polypeptide under conditions that allow the com- 
pound to bind or to otherwise interact with the polypeptide. 
Soluble kinase polypeptide is also added to the mixture. If 
the test compound interacts with the soluble kinase 
polypeptide, it decreases the amount of complex formed or 
activity from the kinase target. This type of assay is par- 
ticularly useful in cases in which compounds are sought that 
interact with specific regions of the kinase. Thus, the soluble 
polypeptide that competes with the target kinase region is 
designed to contain peptide sequences corresponding to the 
region of interest. 

To perform cell free drug screening assays, it is some- 
times desirable to immobilize either the kinase protein, or 
fragment, or its target molecule to facilitate separation of 
complexes from uncomplexed forms of one or both of the 
proteins, as well as to accommodate automation of the assay. 

Techniques for immobilizing proteins on matrices can be 
used in the drug screening assays. In one embodiment, a 
fusion protein can be provided which adds a domain that 
allows the protein to be bound to a matrix. For example, 
glutathione-S-transferase fusion proteins can be adsorbed 
onto glutathione sepharose beads (Sigma Chemical, St. 
Louis, Mo.) or glutathione derivatized microtitre plates, 
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kinase activity in a pharmaceutical composition to a subject 
in need of such treatment, the modulator being identified as 
described herein. 

In yet another aspect of the invention, the kinase proteins 
can be used as "bait proteins" in a two-hybrid assay or 
three-hybrid assay (see, e.g., U.S. Pat. No. 5,283,317; Zer- 
vos et al. (1993) Cell 72:223-232; Madura et al. (1993) /. 
Biol Chem. 268:12046-12054; Bartel et al. (1993) Biotech- 
niques 14:920-924; Iwabuchi et al. (1993) Oncogene 
8:1693/1696; and Brent WO94110300), to identify other 
proteins, which bind to or interact with the kinase and are 
involved in kinase activity. Such kinase-binding proteins are 
also likely to be involved in the propagation of signals by the 
kinase proteins or kinase targets as, for example, down- 
stream elements of a kinase -media ted signaling pathway. 
Alternatively, such kinase-binding proteins are likely to be 
kinase inhibitors. 

The two-hybrid system is based on the modular nature of 
most transcription factors, which consist of separable DNA- 
binding and activation domains. Briefly, the assay utilizes 
20 two different DNA constructs. In one construct, the gene that 
codes for a kinase protein is fused to a gene encoding the 
DNA binding domain of a known transcription factor (e.g., 
GAL-4). In the other construct, a DNA sequence, from a 
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library of DNA sequences, that encodes an unidentified 

which are then combined with the cell lysates (e.g., 35 S- 25 protein ("prey" or "sample") is fused to a gene that codes for 

labeled) and the candidate compound, and the mixture the activation domain of the known transcription factor. If 

incubated under conditions conducive to complex formation the "bait" and the "prey" proteins are able to interact, in 

(e.g., at physiological conditions for salt and pH). Following vivo, forming a kinase-dependent complex, the DNA- 

incubation, the beads are washed to remove any unbound binding and activation domains of the transcription factor 

label, and the matrix immobilized and radiolabel determined 30 are brought into close proximity. This proximity allows 



directly, or in the supernatant after the complexes are 
dissociated. Alternatively, the complexes can be dissociated 
from the matrix, separated by SDS-PAGE, and the level of 
kinase-binding protein found in the bead fraction quantitated 
from the gel using standard electrophoretic techniques. For 
example^ either the polypeptide or its target molecule can be 
immobilized utilizing conjugation of biotin and streptavidin 
using techniques well known in the art. Alternatively, anti- 
bodies reactive with the protein but which do not interfere 



transcription of a reporter gene (e.g., LacZ) which is pper- 
ably linked to a transcriptional regulatory site responsive to 
the transcription factor. Expression of the reporter gene can 
be detected and cell colonies containing the functional 
35 transcription factor can be isolated and used to obtain the 
cloned gene which encodes the protein which interacts with 
the kinase protein. 

This invention further pertains to novel agents identified 
by the above-described screening assays. Accordingly, it is 



with binding of the protein to its target molecule can be 40 within the scope of this invention to further use an agent 

derivatized to the wells of the plate, and the protein trapped identified as described herein in an appropriate animal 

in the wells by antibody conjugation. Preparations of a model. For example, an agent identified as described herein 

kinase-binding protein and a candidate compound are incu- (e.g., a kinase-modulating agent, an antisense kinase nucleic 

bated in the kinase protein-presenting wells and the amount acid molecule, a kinase-specific antibody, or a kinase - 

of complex trapped in the well can be quantitated. Methods 45 binding partner) can be used in an animal or other model to 

for detecting such complexes, in addition to those described determine the efficacy, toxicity, or side effects of treatment 

above for the GST-immobilized complexes, include immu- with such an agent. Alternatively, an agent identified as 

nodetection of complexes using antibodies reactive with the described herein can be used in an animal or other model to 

kinase protein target molecule, or which are reactive with determine the mechanism of action of such an agent, 

kinase protein and compete with the target molecule, as well 50 Furthermore, this invention pertains to uses of novel agents 

as enzyme-linked assays, which rely on detecting an enzy- identified by the above -described screening assays for treat- 

matic activity associated with the target molecule. ments as described herein. 

Agents that modulate one of the kinases of the present The kinase proteins of the present invention are also 

invention can be identified using one or more of the above useful to provide a target for diagnosing a disease or 

assays, alone or in combination. It is generally preferable to 55 predisposition to disease mediated by the peptide, 

use a cell-based or cell free system first and then confirm Accordingly, the invention provides methods for detecting 

activity in an animal or other model system. Such model the presence, or levels of, the protein (or encoding mRNA) 

systems are well known in the art and can readily be in a cell, tissue, or organism. Experimental data as provided 

employed in this context. in FIG. 1 indicates expression in humans in teratocarcinoma, 

Modulators of kinase protein activity identified according 60 ovary, testis, nervous tissue, bladder, infant and fetal brain, 

to these drug screening assays can be used to treat a subject and thyroid gland. The method involves contacting a bio- 

with a disorder mediated by the kinase pathway, by treating logical sample with a compound capable of interacting with 

cells or tissues that express the kinase. Experimental data as the kinase protein such that the interaction can be detected, 

provided in FIG. 1 indicates expression in humans in Such an assay can be provided in a single detection format 

teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 65 or a multi -detection format such as an antibody chip array, 

and fetal brain, and thyroid gland. These methods of treat- One agent for detecting a protein in a sample is an 

ment include the steps of administering a modulator of antibody capable of selectively binding to protein. A bio- 
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logical sample includes tissues, cells and biological fluids 
isolated from a subject, as well as tissues, cells and fluids 
present within a subject. 

The peptides of the present invention also provide targets 
for diagnosing active protein activity, disease, or predispo- s 
sition to disease, in a patient having a variant peptide, 
particularly activities and conditions that are known for 
other members of the family of proteins to which the present 
one belongs. Thus, the peptide can be isolated from a 
biological sample and assayed for the presence of a genetic jq 
mutation that results in aberrant peptide. This includes 
amino acid substitution, deletion, insertion, rearrangement, 
(as the result of aberrant splicing events), and inappropriate 
post-translational modification. Analytic methods include 
altered electrophoretic mobility, altered tryptic peptide 15 
digest, altered kinase activity in cell-based or cell-free assay, 
alteration in substrate or antibody-binding pattern, altered 
isoelectric point, direct amino acid sequencing, and any 
other of the known assay techniques useful for detecting 
mutations in a protein. Such an assay can be provided in a 2 o 
single detection format or a multi-detection format such as 
an antibody chip array. 

In vitro techniques for detection of peptide include 
enzyme linked immunosorbent assays (ELISAs), Western 
blots, immunoprecipitations and immunofluorescence using 25 
a detection reagent, such as an antibody or protein binding 
agent. Alternatively, the peptide can be detected in vivo in a 
subject by introducing into the subject a labeled anti-peptide 
antibody or other types of detection agent. For example, the 
antibody can be labeled with a radioactive marker whose 30 
presence and location in a subject can be detected by 
standard imaging techniques. Particularly useful are meth- 
ods that detect the allelic variant of a peptide expressed in a 
subject and methods which detect fragments of a peptide in 
a sample. 35 

The peptides are also useful in pharmacogenomic analy- 
sis. Pharmacogenomics deal with clinically significant 
hereditary variations in the response to drugs due to altered 
drug disposition and abnormal action in affected persons. 
See, e.g., Eichelbaum, M. {Clin. Exp. Pharmacol Physiol. 40 
23(l(Ml):983-985 (1996)), and Linder, M. W. {Clin. Chem. 
43(2):254-266 (1997)). The clinical outcomes of these 
variations result in severe toxicity of therapeutic drugs in 
certain individuals or therapeutic failure of drugs in certain 
individuals as a result of individual variation in metabolism. 45 
Thus, the genotype of the individual can determine the way 
a therapeutic compound acts on the body or the way the 
body metabolizes the compound. Further, the activity of 
drug metabolizing enzymes effects both the intensity and 
duration of drug action. Thus, the pharmacogenomics of the so 
individual permit the selection of effective compounds and 
effective dosages of such compounds for prophylactic or 
therapeutic treatment based on the individual's genotype. 
The discovery of genetic polymorphisms in some drug 
metabolizing enzymes has explained why some patients do 55 
not obtain the expected drug effects, show an exaggerated 
drug effect, or experience serious toxicity from standard 
drug dosages. Polymorphisms can be expressed in the phe- 
notype of the extensive metabolizer and the phenotype of the 
poor metabolizer. Accordingly, genetic polymorphism may 60 
lead to allelic protein variants of the kinase protein in which 
one or more of the kinase functions in one population is 
different from those in another population. The peptides thus 
allow a target to ascertain a genetic predisposition that can 
affect treatment modality. Thus, in a ligand-based treatment, 65 
polymorphism may give rise to amino terminal extracellular 
domains and/or other substrate-binding regions that are 



more or less active in substrate binding, and kinase activa- 
tion. Accordingly, substrate dosage would necessarily be 
modified to maximize the therapeutic effect within a given 
population containing a polymorphism .'As an alternative to 
genotyping, specific polymorphic peptides could be identi- 
fied. 

The peptides are also useful for treating a disorder char- 
acterized by an absence of, inappropriate, or unwanted 
expression of the protein. Experimental data as provided in 
FIG. 1 indicates expression in humans in teratocarcinoma, 
ovary, testis, nervous tissue, bladder, infant and fetal brain, 
and thyroid gland. Accordingly, methods for treatment 
include the use of the kinase protein or fragments. 

Antibodies 

The invention also provides antibodies that selectively 
bind to one of the peptides of the present invention, a protein 
comprising such a peptide, as well as variants and fragments 
thereof. As used herein, an antibody selectively binds a 
target peptide when it binds the target peptide and does not 
significantly bind to unrelated proteins. An antibody is still 
considered to selectively bind a peptide even if it also binds 
to other proteins that are not substantially homologous with 
the target peptide so long as such proteins share homology 
with a fragment or domain of the peptide target of the 
antibody. In this case, it would be understood that antibody 
binding to the peptide is still selective despite some degree 
of cross-reactivity. 

As used herein, an antibody is defined in terms consistent 
with that recognized within the art: they are multi-subunit 
proteins produced by a mammalian organism in response to 
an antigen challenge. The antibodies of the present invention 
include polyclonal antibodies and monoclonal antibodies, as 
well as fragments of such antibodies, including, but not 
limited to, Fab or F(ab')2, and Fv fragments. 

Many methods are known for generating and/or identify- 
ing antibodies to a given target peptide. Several such meth- 
ods are described by Harlow, Antibodies, Cold Spring 
Harbor Press, (1989). 

In general, to generate antibodies, an isolated peptide is 
used as an immunogen and is administered to a mammalian 
organism, such as a rat, rabbit or mouse. The full-length 
protein, an antigenic peptide fragment or a fusion protein 
can be used. Particularly important fragments are those 
covering functional domains, such as the domains identified 
in FIG. 2, and domain of sequence homology or divergence 
amongst the family, such as those that can readily be 
identified using protein alignment methods and as presented 
in the Figures. 

Antibodies are preferably prepared from regions or dis- 
crete fragments of the kinase proteins. Antibodies can be 
prepared from any region of the peptide as described herein. 
However, preferred regions will include those involved in 
function/activity and/or kinase/binding partner interaction. 
FIG. 2 can be used to identify particularly important regions 
while sequence alignment can be used to identify conserved 
and unique sequence fragments. 

An antigenic fragment will typically comprise at least 8 
contiguous amino acid residues. The antigenic peptide can 
comprise, however, at least 10, 12, 14, 16 or more amino 
acid residues. Such fragments can be selected on a physical 
property, such as fragments correspond to regions that are 
located on the surface of the protein, e.g., bydrophilic 
regions or can be selected based on sequence uniqueness 
(see FIG. 2). 

Detection on an antibody of the present invention can be 
facilitated by coupling (i.e., physically linking) the antibody 
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to a detectable substance. Examples of detectable substances 
include various enzymes, prosthetic groups, fluorescent 
materials, luminescent materials, bioluminescent materials, 
and radioactive materials. Examples of suitable enzymes 
include horseradish peroxidase, alkaline phosphatase, 5 
p-galactosidase, or acetylcholinesterase; examples of suit- 
able prosthetic group complexes include streptavidin/biotin 
and avidin/biotin; examples of suitable fluorescent materials 
include umbelliferone, fluorescein, fluorescein 
isothiocyanate, rhodamine, dichlorotriazinylamine 10 
fluorescein, dansyl chloride or phycoerythrin; an example of 
a luminescent material includes luminol; examples of biolu- 
minescent materials include luciferase, luciferin, and 
aequorin, and examples of suitable radioactive material 
include 125 I, 131 I, 35 S or 3 H. 15 

Antibody Uses 

The antibodies can be used to isolate one of the proteins 
of the present invention by standard techniques, such as 
affinity chromatography or immunoprecipitation. The anti- 
bodies can facilitate the purification of the natural protein 
from cells and recombinantly produced protein expressed in 
host cells. In addition, such antibodies are useful to detect 
the presence of one of the proteins of the present invention 
in cells or tissues to determine the pattern of expression of 
the protein among various tissues in an organism and over 
the course of normal development. Experimental data as 
provided in FIG. 1 indicates that the kinase proteins of the 
present invention are expressed in humans in 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 
brain, and thyroid gland, as indicated by virtual northern blot 
analysis. In addition, PCR-based tissue screening panels 
indicate expression in fetal brain. Further, such antibodies 
can be used to detect protein in situ, in vitro, or in a cell 
lysate or supernatant in order to evaluate the abundance and 
pattern of expression. Also, such antibodies can be used to 
assess abnormal tissue distribution or abnormal expression 
during development or progression of a biological condition. 
Antibody detection of circulating fragments of the full 
length protein can be used to identify turnover. 

Further, the antibodies can be used to assess expression in 
disease states such as in active stages of the disease or in an 
individual with a predisposition toward disease related to the 
protein's function. When a disorder is caused by an inap- 
propriate tissue distribution, developmental expression, 45 tially of, or comprise a nucleotide sequence that encodes one 
level of expression of the protein, or expressed/processed of the kinase peptides of the present invention, an allelic 
form, the antibody can be prepared against the normal variant thereof, or an ortholog or paralog thereof, 
protein. Experimental data as provided in FIG. 1 indicates As used herein, an "isolated" nucleic acid molecule is one 
expression in humans in teratocarcinoma, ovary, testis, ner- that is separated from other nucleic acid present in the 
vous tissue, bladder, infant and fetal brain, and thyroid 50 natural source of the nucleic, acid. Preferably, an "isolated" 
gland. If a disorder is characterized by a specific mutation in nucleic acid is free of sequences which naturally flank the 
the protein, antibodies specific for this mutant protein can be nucleic acid (i.e., sequences located at the 5' and 3' ends of 
used to assay for the presence of the specific mutant protein. the nucleic acid) in the genomic DNAof the organism from 
The antibodies can also be used to assess normal and which the nucleic acid is derived. However, there can be 
aberrant subcellular localization of cells in the various 55 some flanking nucleotide sequences, for example up to 
tissues in an organism. Experimental data as provided in about 5KB, 4KB, 3KB, 2KB, or 1KB or less, particularly 
FIG. 1 indicates expression in humans in teratocarcinoma, contiguous peptide encoding sequences and peptide encod- 
ovary, testis, nervous tissue, bladder, infant and fetal brain, ing sequences within the same gene but separated by introns 
and thyroid gland. The diagnostic uses can be applied, not in the genomic sequence. The important point is that the 
only in genetic testing, but also in monitoring a treatment 60 nucleic acid is isolated from remote and unimportant flank - 
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proteins can be used to identify individuals that require 
modified treatment modalities. The antibodies are also use- 
ful as diagnostic tools as an immunological marker' for 
aberrant protein analyzed by electrophoretic mobility, iso- 
electric point, tryptic peptide digest, and other physical 
assays known to those in the art. 

The antibodies are also useful for tissue typing. Experi- 
mental data as provided in FIG. 1 indicates expression in 
humans in teratocarcinoma, ovary, testis, nervous tissue, 
bladder, infant and fetal brain, and thyroid gland. Thus, 
where a specific protein has been correlated with expression 
in a specific tissue, antibodies that are specific for this 
protein can be used to identify a tissue type. 

The antibodies are also useful for inhibiting protein 
function, for example, blocking the binding of the kinase 
peptide to a binding partner such as a substrate. These uses 
can also be applied in a therapeutic context in which 
treatment involves inhibiting the protein's function. An 
antibody can be used, for example, to block binding, thus 
modulating (agonizing or antagonizing) the peptides activ- 
ity. Antibodies can be prepared against specific fragments 
containing sites required for function or against intact pro- 
tein that is associated with a cell or cell membrane. See FIG. 
2 for structural information relating to the proteins of the 
present invention. 

The invention also encompasses kits for using antibodies 
to detect the presence of a protein in a biological sample. 
The kit can comprise antibodies such as a labeled or label- 
able antibody and a compound or agent for detecting protein 
in a biological sample; means for determining the amount of 
protein in the sample; means for comparing the amount of 
protein in the sample with a standard; and instructions for 
use. Such a kit can be supplied to detect a single protein or 
epitope or can be configured to detect one of a multitude of 
epitopes, such as in an antibody detection array. Arrays are 
described in detail below for nuleic acid arrays and similar 
methods have been developed for antibody arrays. 

Nucleic Acid Molecules 

The present invention further provides isolated nucleic 
acid molecules that encode a kinase peptide or protein of the 
present invention (cDNA, transcript and genomic sequence). 
Such nucleic acid molecules will consist of, consist essen- 



modality. Accordingly, where treatment is ultimately aimed 
at correcting expression level or the presence of aberrant 
sequence and aberrant tissue distribution or developmental 
expression, antibodies directed against the protein or rel- 
evant fragments can be used to monitor therapeutic efficacy. 55 

Additionally, antibodies are useful in pharmacogenomic 
analysis. Thus, antibodies prepared against polymorphic 



ing sequences such that it can be subjected to the specific 
manipulations described herein such as recombinant 
expression, preparation of probes and primers, and other 
uses specific to the nucleic acid sequences. 

Moreover, an "isolated" nucleic acid molecule, such as a 
transcript/cDNA molecule, can be substantially free of other 
cellular material, or culture medium when produced by 
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recombinant techniques, or chemical precursors or other 
chemicals when chemically synthesized. However, the 
nucleic acid molecule can be fused to other coding or 
regulatory sequences and still be considered isolated. 

For example, recombinant DNA molecules contained in a 
vector are considered isolated. Further examples of isolated 
DNA molecules include recombinant DNA molecules main- 
tained in heterologous host cells or purified (partially or 
substantially) DNA molecules in solution. Isolated RNA 



a protein from precursor to a mature form, facilitate protein 
trafficking, prolong or shorten protein half-life or facilitate 
manipulation of a protein for assay or. production, among 
other things. As generally is the case in situ, the additional 
amino acids may be processed away from the mature protein 
by cellular enzymes. 

As mentioned above, the isolated nucleic acid molecules 
include, but are not limited to, the sequence encoding the 
kinase peptide alone, the sequence encoding the mature 
peptide and additional coding sequences, such as a leader or 



molecules include in vivo or in vitro RNA transcripts of the 10 ♦ / * t • \ 

nKT a ^ :„.^1 P v";™a secretory sequence (e.g., a pre-pro or pro-protein sequence), 



isolated DNA molecules of the present invention. Isolated 
nucleic acid molecules according to the present invention 
further include such molecules produced synthetically. 

Accordingly, the present invention provides nucleic acid 
molecules that consist of the nucleotide sequence shown in 
FIG. 1 or 3 (SEQ ID NO:l, transcript sequence and SEQ ID 
NO: 3, genomic sequence), or any nucleic acid molecule that 
encodes the protein provided in FIG. 2, SEQ ID NO:2. A 
nucleic acid molecule consists of a nucleotide sequence 
when the nucleotide sequence is the complete nucleotide 
sequence of the nucleic acid molecule. 

The present invention further provides nucleic acid mol- 
ecules that consist essentially of the nucleotide sequence 
shown in FIG. 1 or 3 (SEQ ID NO:l, transcript sequence and 
SEQ ID NO: 3, genomic sequence), or any nucleic acid 
molecule that encodes the protein provided in FIG. 2, SEQ 
ID NO:2, A nucleic acid molecule consists essentially of a 
nucleotide sequence when such a nucleotide sequence is 
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the sequence encoding the mature peptide, with or without 
the additional coding sequences, plus additional non-coding 
sequences, for example introns and non-coding 5' and 3' 
sequences such as transcribed but non-translated sequences 
that play a role in transcription, mRNA processing 
(including splicing and polyadenylation signals), ribosome 
binding and stability of mRNA. In addition, the nucleic acid 
molecule may be fused to a marker sequence encoding, for 
example, a peptide that facilitates purification. 

Isolated nucleic acid molecules can be in the form of 
RNA, such as mRNA, or in the form DNA, including cDNA 
and genomic DNA obtained by cloning or produced by 
chemical synthetic techniques or by a combination thereof. 
The nucleic acid, especially DNA, can be double-stranded or 
single-stranded. Single-stranded nucleic acid can be the 
coding strand (sense strand) or the non-coding strand (anti- 
sense strand). 

The invention further provides nucleic acid molecules that 



present with only a few additional nucleic acid residues in 3Q encode f ragme nts of the peptides of the present invention as 
the final nucleic acid molecule. well ^ nucleic acid mo i ecu i es ma t encode obvious variants 
The present invention further provides nucleic acid mol- of the kinase proteins of the present invention that are 
ecules that comprise the nucleotide sequences shown in FIG. described above. Such nucleic acid molecules may be natu- 
1 or 3 (SEQ ID NO:l, transcript sequence and SEQ ID rally occurring, such as allelic variants (same locus), para- 
NO:3, genomic sequence), or any nucleic acid molecule that 35 logs (different locus), and orthologs (different organism), or 
encodes the protein provided in FIG. 2, SEQ ID NO:2. A may be constructed by recombinant DNA methods or by 
nucleic acid molecule comprises a nucleotide sequence chemical synthesis. Such non-naturally occurring variants 
when the nucleotide sequence is at least part of the final may be made by mutagenesis techniques, including those 
nucleotide sequence of the nucleic acid molecule. In such a applied to nucleic acid molecules, cells, or organisms, 
fashion, the nucleic acid molecule can be only the nucleotide 40 Accordingly, as discussed above, the variants can contain 
sequence or have additional nucleic acid residues, such as nucleotide substitutions, deletions, inversions and inser- 
nucleic acid residues that are naturally associated with it or tions. Variation can occur in either or both the coding and 
heterologous nucleotide sequences. Such a nucleic acid non-coding regions. The variations can produce both con- 
molecule can have a few additional nucleotides or can servative and non-conservative amino acid substitutions, 
comprises several hundred or more additional nucleotides. A 45 



brief description of how various types of these nucleic acid 
molecules can be readily made/isolated is provided below. 

In FIGS. 1 and 3, both coding and non-coding sequences 
are provided. Because of the source of the present invention, 



The present invention further provides non-coding frag- 
ments of the nucleic acid molecules provided in FIGS. 1 and 
3. Preferred non-coding fragments include, but are not 
limited to, promoter sequences, enhancer sequences, gene 
modulating sequences and gene termination sequences. 



humans genomic sequence (FIG. 3) and cDNA/transcript 50 Such fragments are useful in controlling heterologous gene 



sequences (FIG. 1), the nucleic acid molecules in the Figures 
will contain genomic intronic sequences, 5' and 3* non- 
coding sequences, gene regulatory regions and non-coding 
intergenic sequences. In general such sequence features are 
either noted in FIGS. 1 and 3 or can readily be identified 55 
using computational tools known in the art. As discussed 
below, some of the non-coding regions, particularly gene 
regulatory elements such as promoters, are useful for a 
variety of purposes, e.g. control of heterologous gene 

expression, target for identifying gene activity modulating 6 o bearing regions of the" peptide, or can be useful as DNA 
compounds, and are particularly claimed as fragments of the probes and primers. Such fragments can be isolated using 
genomic sequence provided herein. the known nucleotide sequence to synthesize an oligonucle- 

The isolated nucleic acid molecules can encode the otide probe. A labeled probe can then be used to screen a 
mature protein plus additional amino or carboxyl-terminal cDNA library, genomic DNA library, or mRNA to isolate 
amino acids, or amino acids interior to the mature peptide 65 nucleic acid corresponding to the coding region. Further, 
(when the mature form has more than one peptide chain, for primers can be used in PCR reactions to clone specific 
instance). Such sequences may play a role in processing of regions of gene. 



expression and in developing screens to identify gene- 
modulating agents. A promoter can readily be identified as 
being 5' to the ATG start site in the genomic sequence 
provided in FIG. 3. 

A fragment comprises a contiguous nucleotide sequence 
greater than 12 or more nucleotides. Further, a fragment 
could at least 30, 40, 50, 100, 250 or 500 nucleotides in 
length. The length of the fragment will be based on its 
intended use. For example, the fragment can encode epitope 
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A probe/primer typically comprises substantially a puri- 
fied oligonucleotide or oligonucleotide pair. The oligonucle- 
otide typically comprises a region of nucleotide sequence 
that hybridizes under stringent conditions to at least about 
12, 20, 25, 40, 50 or more consecutive nucleotides. 5 

Orthologs, homologs, and allelic variants can be identified 
using methods well known in the art. As described in the 
Peptide Section, these variants comprise a nucleotide 
sequence encoding a peptide that is typically 60-70%, 
70-80%, 80-90%, and more typically at least about 90-95% 10 
or more homologous to the nucleotide sequence shown in 
the Figure sheets or a fragment of this sequence. Such 
nucleic acid molecules can readily be identified as being 
able to hybridize under moderate to stringent conditions, to 
the nucleotide sequence shown in the Figure sheets or a 15 
fragment of the sequence. Allelic variants can readily be 
determined by genetic locus of the encoding gene. The gene 
encoding the novel kinase protein of the present invention is 
located on a genome component that has been mapped to 
human chromosome 22 (as indicated in FIG. 3), which is 20 
supported by multiple lines of evidence, such as STS and 
BAC map data. 

FIG. 3 provides information on SNPs that have been 
found in the gene encoding the kinase protein of the present 
invention. SNPs were identified at 42 different nucleotide 25 
positions. Some of these SNPs, which are located outside the 
ORF and in introns, may affect gene transcription. 

As used herein, the term "hybridizes under stringent 
conditions" is intended to describe conditions for hybrid- 30 
ization and washing under which nucleotide sequences 
encoding a peptide at least 60-70% homologous to each 
other typically remain hybridized to each other. The condi- 
tions can be such that sequences at least about 60%, at least 
about 70%, or at least about 80% or more homologous to 35 
each other typically remain hybridized to each other. Such 
stringent conditions are known to those skilled in the art and 
can be found in Current Protocols in Molecular Biology, 
John Wiley & Sons, N.Y. (1989), 6.3.1-6.3.6. One example 
of stringent hybridization conditions are hybridization in 6x 4Q 
sodium chloride/sodium citrate (SSC) at about 45C, fol- 
lowed by one or more washes in 0.2xSSC, 0.1% SDS at 
50-65C. Examples of moderate to low stringency hybrid- 
ization conditions are well known in the art. 

Nucleic Acid Molecule Uses 45 

The nucleic acid molecules of the present invention are 
useful for probes, primers, chemical intermediates, and in 
biological assays. The nucleic acid molecules are useful as 
a hybridization probe for messenger RNA, transcript/cDNA 50 
and genomic DNA to isolate full-length cDNAand genomic 
clones encoding the peptide described in FIG. 2 and to 
isolate cDNA and genomic clones that correspond to vari- 
ants (alleles, orthologs, etc.) producing the same or related 
peptides shown in FIG. 2. As illustrated in FIG. 3, SNPs 55 
were identified at 42 different nucleotide positions. 

The probe can correspond to any sequence along the 
entire length of the nucleic acid molecules provided in the 
Figures. Accordingly, it could be derived from 5' noncoding 
regions, the coding region, and 3 f noncoding regions. 60 
However, as discussed, fragments are not to be construed as 
encompassing fragments disclosed prior to the present 
invention. 

The nucleic acid molecules are also useful as primers for 
PCR to amplify any given region of a nucleic acid molecule 65 
and are useful to synthesize antisense molecules of desired 
length and sequence. 
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The nucleic acid molecules are also useful for construct- 
ing recombinant vectors. Such vectors include expression 
vectors that express a portion of, or all of, the peptide 
sequences. Vectors also include insertion vectors, used to 
integrate into another nucleic acid molecule sequence, such 
as into the cellular genome, to alter in situ expression of a 
gene and/or gene product. For example, an endogenous 
coding sequence can be replaced via homologous recombi- 
nation with all or part of the coding region containing one or 
more specifically introduced mutations. 

The nucleic acid molecules are also useful for expressing 
antigenic portions of the proteins. 

The nucleic acid molecules are also useful as probes for 
determining the chromosomal positions of the nucleic acid 
molecules by means of in situ hybridization methods. The 
gene encoding the novel kinase protein of the present 
invention is located on a genome component that has been 
mapped to human chromosome 22 (as indicated in FIG. 3), 
which is supported by multiple lines of evidence, such as 
STS and BAC map data. 

The nucleic acid molecules are also useful in making 
vectors containing the gene regulatory regions of the nucleic 
acid molecules of the present invention. 

The nucleic acid molecules are also useful for designing 
ribozymes corresponding to all, or a part, of the mRNA 
produced from the nucleic acid molecules described herein. 

The nucleic acid molecules are also useful for making 
vectors that express part, or all, of the peptides. 

The nucleic acid molecules are also useful for construct- 
ing host cells expressing a part, or all, of the nucleic acid 
molecules and peptides. 

The nucleic acid molecules are also useful for construct- 
ing transgenic animals expressing all, or a part, of the 
nucleic acid molecules and peptides. 

The nucleic acid molecules are also useful as hybridiza- 
tion probes for determining the presence, level, form and 
distribution of nucleic acid expression. Experimental data as 
provided in FIG. 1 indicates that the kinase proteins of the 
present invention are expressed in humans in 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 
brain, and thyroid gland, as indicated by virtual northern blot 
analysis. In addition, PCR-based tissue screening panels 
indicate expression in fetal brain. Accordingly, the probes 
can be used to detect the presence of, or to determine levels 
of, a specific nucleic acid molecule in cells, tissues, and in 
organisms. The nucleic acid whose level is determined can 
be DNAor RNA. Accordingly, probes corresponding to the 
peptides described herein can be used to assess expression 
and/or gene copy number in a given cell, tissue, or organism. 
These uses are relevant for diagnosis of disorders involving 
an increase or decrease in kinase protein expression relative 
to normal results. 

In vitro techniques for detection of mRNA include North- 
ern hybridizations and in situ hybridizations. In vitro tech- 
niques for detecting DNA includes Southern hybridizations 
and in situ hybridization. 

Probes can be used as a part of a diagnostic test kit for 
identifying cells or tissues that express a kinase protein, such 
as by measuring a level of a kinase-encoding nucleic acid in 
a sample of cells from a subject e.g., mRNA or genomic 
DNA, or determining if a kinase gene has been mutated. 
Experimental data as provided in FIG. 1 indicates that the 
kinase proteins of the present invention are expressed in 
humans in teratocarcinoma, ovary, testis, nervous tissue, 
bladder, infant brain, and thyroid gland, as indicated by 
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virtual northern blot analysis. In addition, PCR -based tissue 
screening panels indicate expression in fetal brain. 

Nucleic acid expression assays are useful for drug screen- 
ing to identify compounds that modulate kinase nucleic acid 
expression. 5 

The invention thus provides a method for identifying a 
compound that can be used to treat a disorder associated 
with nucleic acid expression of the kinase gene, particularly 
biological and pathological processes that are mediated by 
the kinase in cells and tissues that express it. Experimental 10 
data as provided in FIG. 1 indicates expression in humans in 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 
and fetal brain, and thyroid gland. The method typically 
includes assaying the ability of the compound to modulate 
the expression of the kinase nucleic acid and thus identifying 
a compound that can be used to treat a disorder characterized 
by undesired kinase nucleic acid expression. The assays can 
be performed in cell-based and cell-free systems. Cell-based 
assays include cells naturally expressing the kinase nucleic 
acid or recombinant cells genetically engineered to express 
specific nucleic acid sequences. 

The assay for kinase nucleic acid expression can involve 
direct assay of nucleic acid levels, such as mRNA levels, or 
on collateral compounds involved in the signal pathway. 
Further, the expression of genes that are up- or down- 
regulated in response to the kinase protein signal pathway 
can also be assayed. In this embodiment the regulatory 
regions of these genes can be operably linked to a reporter 
gene such as luciferase. 

Thus, modulators of kinase gene expression can be iden- 30 
tified in a method wherein a cell is contacted with a 
candidate compound and the expression of mRNA deter- 
mined. The level of expression of kinase mRNA in the 
presence of the candidate compound is compared to the level 
of expression of kinase mRNA in the absence of the candi- 35 
date compound. The candidate compound can then be iden- 
tified as a modulator of nucleic acid expression based on this 
comparison and be used, for example to treat a disorder 
characterized by aberrant nucleic acid expression. When 
expression of mRNA is statistically significantly greater in 40 
the presence of the candidate compound than in its absence, 
the candidate compound is identified as a stimulator of 
nucleic acid expression. When nucleic acid expression is 
statistically significantly less in the presence of the candidate 
compound than in its absence, the candidate compound is 45 
identified as an inhibitor of nucleic acid expression. 

The invention further provides methods of treatment, with 
the nucleic acid as a target, using a compound identified 
through drug screening as a gene modulator to modulate 
kinase nucleic acid expression in cells and tissues that 50 
express the kinase. Experimental data as provided in FIG. 1 
indicates that the kinase proteins of the present invention are 
expressed in humans in teratocarcinoma, ovary, testis, ner- 
vous tissue, bladder, infant brain, and thyroid gland, as 
indicated by virtual northern blot analysis. In addition, 55 
PCR-based tissue screening panels indicate expression in 
fetal brain. Modulation includes both up-regulation (i.e. 
activation or agonization) or down-regulation (suppression 
or antagonization) or nucleic acid expression. 

Alternatively, a modulator for kinase nucleic acid expres- 60 
sion can be a small molecule or drug identified using the 
screening assays described herein as long as the drug or 
small molecule inhibits the kinase nucleic acid expression in 
the cells and tissues that express the protein. Experimental 
data as provided in FIG. 1 indicates expression in humans in 65 
teratocarcinoma, ovary, testis, nervous tissue, bladder, infant 
and fetal brain, and thyroid gland. 



The nucleic acid molecules are also useful for monitoring 
the effectiveness of modulating compounds on the expres- 
sion or activity of the kinase gene in clinical trials or in a 
treatment regimen. Thus, the gene expression pattern can 
serve as a barometer for the continuing effectiveness of 
treatment with the compound, particularly with compounds 
to which a patient can develop resistance. The gene expres- 
sion pattern can also serve as a marker indicative of a 
physiological response of the affected cells to the compound. 
Accordingly, such monitoring would allow either increased 
administration of the compound or the administration of 
alternative compounds to which the patient has not become 
resistant. Similarly, if the level of nucleic acid expression 
falls below a desirable level, administration of the com- 
pound could be commensurately decreased. 

The nucleic acid molecules are also useful in diagnostic 
assays for qualitative changes in kinase nucleic acid 
expression, and particularly in qualitative changes that lead 
to pathology. The nucleic acid molecules can be used to 
detect mutations in kinase genes and gene expression prod- 
ucts such as mRNA. The nucleic acid molecules can be used 
as hybridization probes to detect naturally occurring genetic 
mutations in the kinase gene and thereby to determine 
whether a subject with the mutation is at risk for a disorder 
caused by the mutation. Mutations include deletion, 
addition, or substitution of one or more nucleotides in the 
gene, chromosomal rearrangement, such as inversion or 
transposition, modification of genomic DNA, such as aber- 
rant methylation patterns or changes in gene copy number, 
such as amplification. Detection of a mutated form of the 
kinase gene associated with a dysfunction provides a diag- 
nostic tool for an active disease or susceptibility to disease 
when the disease results from overexpression, 
underexpression, or altered expression of a kinase protein. 

Individuals carrying mutations in the kinase gene can be 
detected at the nucleic acid level by a variety of techniques. 
FIG. 3 provides information on SNPs that have been found 
in the gene encoding the kinase protein of the present 
invention. SNPs were identified at 42 different nucleotide 
positions. Some of these SNPs, which are located outside the 
ORF and in introns, may affect gene transcription. The gene 
encoding the novel kinase protein of the present invention is 
located on a genome component that has been mapped to 
human chromosome 22 (as indicated in FIG. 3), which is 
supported by multiple lines of evidence, such as STS and 
BAC map data. Genomic DNA can be analyzed directly or 
can be amplified by using PCR prior to analysis. RNA or 
cDNAcan be used in the same way. In some uses, detection 
of the mutation involves the use of a probe/primer in a 
polymerase chain reaction (PCR) (see, e.g. U.S. Pat. Nos. 
4,683,195 and 4,683,202), such as anchor PCR or RACE 
PCR, or, alternatively, in a ligation chain reaction (LCR) 
(see, e.g., Landegran et al., Science 241:1077-1080 (1988); 
and Nakazawa et al.,fWAS 91:360-364 (1994)), the latter of 
which can be particularly useful for detecting point muta- 
tions in the gene (see Abravaya et al., Nucleic Acids Res. 
23:675-682 (1995)). This method can include the steps of 
collecting a sample of cells from a patient, isolating nucleic 
acid (e.g., genomic, mRNA or both) from the cells of the 
sample, contacting the nucleic acid sample with one or more 
primers which specifically hybridize to a gene under con- 
ditions such that hybridization and amplification of the gene 
(if present) occurs, and detecting the presence or absence of 
an amplification product, or detecting the size of the ampli- 
fication product and comparing the length to a control 
sample. Deletions and insertions can be detected by a change 
in size of the amplified product compared to the normal 
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genotype. Point mutations can be identified by hybridizing 
amplified DNA to normal RNA or antisense DNA 
sequences. 

Alternatively, mutations in a kinase gene can be directly 
identified, for example, by alterations in restriction enzyme 
digestion patterns determined by gel electrophoresis. 

Further, sequence-specific ribozymes (U.S. Pat. No. 
5,498,531) can be used to score for the presence of specific 
mutations by development or loss of a ribozyme cleavage 
site. Perfectly matched sequences can be distinguished from 
mismatched sequences by nuclease cleavage digestion 
assays or by differences in melting temperature. 

Sequence changes at specific locations can also be 
assessed by nuclease protection assays such as RNase and 
SI protection or the chemical cleavage method. 
Furthermore, sequence differences between a mutant kinase 
gene and a wild-type gene can be determined by direct DNA 
sequencing. A variety of automated sequencing procedures 
can be utilized when performing the diagnostic assays 
(Naeve, C. W., (1995) Biotechniques 19:448), including 
sequencing by mass spectrometry (see, e.g., PCT Interna- 
tional Publication No. WO 94/16101; Cohen et al., Adv. 
Chromatogr. 36:127-162 (1996); and Griffin et al„ Appl. 
Biochem. BiotechnoL 38:147-159 (1993)). 

Other methods for detecting mutations in the gene include 
methods in which protection from cleavage agents is used to 
detect mismatched bases in RNA/RNA or RNA/DNA 
duplexes (Myers etal, Science 230:1242 (1985)); Cotton et 
al., PNAS 85:4397 (1988); Saleeba et al., Metk EnzymoL 21 
7:286-295 (1992)), electrophoretic mobility of mutant and 
wild type nucleic acid is compared (Orita et al., PNAS 
86:2766 (1989); Cotton et al, MutaL Res. 285:125-144 
(1993); and Hayashi et al., Genet. Anal. Tech. Appl. 9:73-79 
(1992)), and movement of mutant or wild-type fragments in 
poly aery la mide gels containing a gradient of denaturant is 
assayed using denaturing gradient gel electrophoresis 
(Myers et al., Nature 313:495 (1985)). Examples of other 
techniques for detecting point mutations include selective 
oligonucleotide hybridization, selective amplification, and 
selective primer extension. 

The nucleic acid molecules are also useful for testing an 
individual for a genotype that while not necessarily causing 
the disease, nevertheless affects the treatment modality. 
Thus, the nucleic acid molecules can be used to study the 
relationship between an individual's genotype and the indi- 
vidual's response to a compound used for treatment 
(pharmacogenomic relationship). Accordingly, the nucleic 
acid molecules described herein can be used to assess the 
mutation content of the kinase gene in. an individual in order 
to select an appropriate compound or dosage regimen for 
treatment. FIG. 3 provides information on SNPs that have 
been found in the gene encoding the kinase protein of the 
present invention. SNPs were identified at 42 different 
nucleotide positions. Some of these SNPs, which are located 
outside the ORF and in introns, may affect gene transcrip- 
tion. 

Thus nucleic acid molecules displaying genetic variations 
that affect treatment provide a diagnostic target that can be 
used to tailor treatment in an individual. Accordingly, the 
production of recombinant cells and animals containing 
these polymorphisms allow effective clinical design of treat- 
ment compounds and dosage regimens. 

The nucleic acid molecules are thus useful as antisense 
constructs to control kinase gene expression in cells, tissues, 
and organisms. A DNA antisense nucleic acid molecule is 
designed to be complementary to a region of the gene 
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involved in transcription, preventing transcription and hence 
production of kinase protein. An antisense RNA or DNA 
nucleic acid molecule would hybridize to the mRNA and 
thus block translation of mRNA into kinase protein. 

Alternatively, a class of antisense molecules can be used 
to inactivate mRNA in order to decrease expression of 
kinase nucleic acid. Accordingly, these molecules can treat 
a disorder characterized by abnormal or undesired kinase 
nucleic acid expression. This technique involves cleavage 
by means of ribozymes containing nucleotide sequences 
complementary to one or more regions in the mRNA that 
attenuate the ability of the mRNA to be translated. Possible 
regions include coding regions and particularly coding 
regions corresponding to the catalytic and other functional 
activities of the kinase protein, such as substrate binding. 

The nucleic acid molecules also provide vectors for gene 
therapy in patients containing cells that are aberrant in 
kinase gene expression. Thus, recombinant cells, which 
include the patient's cells that have been engineered ex vivo 
and returned to the patient, are introduced into an individual 
where the cells produce the desired kinase protein to treat the 
individual. 

The invention also encompasses kits for detecting the 
presence of a kinase nucleic acid in a biological sample. 
Experimental data as provided in FIG. 1 indicates that the 
kinase proteins of the present invention are expressed in 
humans in teratocarcinoma, ovary, testis, nervous tissue, 
bladder, infant brain, and thyroid gland, as indicated by 
virtual northern blot analysis. In addition, PCR-based tissue 
screening panels indicate expression in fetal brain. For 
example, the kit can comprise reagents such as a labeled or 
labelable nucleic acid or agent capable of detecting kinase 
nucleic acid in a biological sample; means for determining 
the amount of kinase nucleic acid in the sample; and means 
for comparing the amount of kinase nucleic acid in the 
sample with a standard. The compound or agent can be 
packaged in a suitable container. The kit can further com- 
prise instructions for using the kit to detect kinase protein 
nRNA or DNA. 

Nucleic Acid Arrays 

The present invention further provides nucleic acid detec- 
tion kits, such as arrays or microarrays of nucleic acid 
molecules that are based on the sequence information pro- 
vided in FIGS. 1 and 3 (SEQ ID NOS:l and 3). 

As used herein "Arrays" or "Microarrays" refers to an 
array of distinct polynucleotides or oligonucleotides synthe- 
sized on a substrate, such as paper, nylon or other type of 
membrane, filter, chip, glass slide, or any other suitable solid 
support. In one embodiment, the microarray is prepared and 
used according to the methods described in U.S. Pat. No. 
5,837,832, Chee et al., PCT application W095/11995 (Chee 
et al.), Lockhart, D. J. et al. (1996; Nat. Biotech. 14: 
1675-1680) and Schena, M. et al. (1996; Proc, Natl. Acad. 
Sci. 93: 10614-10619), all of which are incorporated herein 
in their entirety by reference. In other embodiments, such 
arrays are produced by the methods described by Brown et 
al., U.S. Pat. No. 5,807,522. 

The microarray or detection kit is preferably composed of 
a large number of unique, single-stranded nucleic acid 
sequences, usually either synthetic antisense oligonucle- 
otides or fragments of cDNAs, fixed to a solid support. The 
oligonucleotides are preferably about 6-60 nucleotides in 
length, more preferably 15-30 nucleotides in length, and 
most preferably about 20-25 nucleotides in length. For a 
certain type of microarray or detection kit, it may be 
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preferable to use oligonucleotides that are only 7-20 nucle- 
otides in length. The microarray or detection kit may contain 
oligonucleotides that cover the known 5', or 3\ sequence, 
sequential "oligonucleotides which cover the full length 
sequence; or unique oligonucleotides selected from particu- 5 
lar areas along the length of the sequence. Polynucleotides 
used in the microarray or detection kit may be oligonucle- 
otides that are specific to a gene or genes of interest. 

In order to produce oligonucleotides to a known sequence 
for a microarray or detection kit, the gene(s) of interest (or l0 - 
an ORF identified from the contigs of the present invention) 
is typically examined using a computer algorithm which 
starts at the 5' or at the 3' end of the nucleotide sequence. 
Typical algorithms will then identify oligomers of defined 
length that are unique to the gene, have a GC content within 
a range suitable for hybridization, and lack predicted sec- 
ondary structure that may interfere with hybridization. In 
certain situations it may be appropriate to use pairs of 
oligonucleotides on a microarray or detection kit. The 
"pairs" will be identical, except for one nucleotide that 
preferably is located in the center of the sequence. The 
second oligonucleotide in the pair (mismatched by one) 
serves as a control. The number of oligonucleotide pairs may 
range from two to one million. The oligomers are synthe- 
sized at designated areas on a substrate using a light-directed 
chemical process. The substrate may be paper, nylon or 25 
other type of membrane, filter, chip, glass slide or any other 
suitable solid support. 

In another aspect, an oligonucleotide may be synthesized 
on the surface of the substrate by using a chemical coupling 
procedure and an ink jet application apparatus, as described 30 
in PCT application W095/251116 (Baldeschweiler et al.) 
which is incorporated herein in its entirety by reference. In 
another aspect, a "gridded" array analogous to a dot (or slot) 
blot may be used to arrange and link cDNA fragments or 
oligonucleotides to the surface of a substrate using a vacuum 
system, thermal, UV, mechanical or chemical bonding pro- 
cedures. An array, such as those described above, may be 
produced by hand or by using available devices (slot blot or 
dot blot apparatus), materials (any suitable solid support), 
and machines (including robotic instruments), and may 
contain 8, 24, 96, 384, 1536, 6144 or more oligonucleotides, 
or any other number between two and one million which 
lends itself to the efficient use of commercially available 
instrumentation. 

In order to conduct sample analysis using a microarray or 
detection kit, the RNA or DNA from a biological sample is 
made into hybridization probes. The mRNA is isolated, and 
cDNA is produced and used as a template to make antisense 
RNA (aRNA). The aRNA is amplified in the presence of 
fluorescent nucleotides, and labeled probes are incubated 
with the microarray or detection kit so that the probe 50 
sequences hybridize to complementary oligonucleotides of 
the microarray or detection kit. Incubation conditions are 
adjusted so that hybridization occurs with precise comple- 
mentary matches or with various degrees of less comple- 
mentarity. After removal of nonhybridized probes, a scanner 
is used to determine the levels and patterns of fluorescence. 
The scanned images are examined to determine degree of 
complementarity and the relative abundance of each oligo- 
nucleotide sequence on the microarray or detection kit. The 
biological samples may be obtained from any bodily fluids 
(such as blood, urine, saliva, phlegm, gastric juices, etc.), 
cultured cells, biopsies, or other tissue preparations. A 
detection system may be used to measure the absence, 
presence, and amount of hybridization for all of the distinct 
sequences simultaneously. This data may be used for large- 
scale correlation studies on the sequences, expression 65 
patterns, mutations, variants, or polymorphisms among 
samples. 
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Using such arrays, the present invention provides meth- 
ods to identify the expression of the kinase proteins/peptides 
of the present invention. In detail, such- methods comprise 
incubating a test sample with one or more nucleic acid 
molecules and assaying for binding of the nucleic acid 
molecule with components within the test sample. Such 
assays will typically involve arrays comprising many genes, 
at least one of which is a gene of the present invention and 
or alleles of the kinase gene of the present invention, FIG. 
3 provides information on SNPs that have been found in the 
gene encoding the kinase protein of the present invention. 
SNPs were identified at 42 different nucleotide positions. 
Some of these SNPs, which are located outside the ORF and 
in introns, may affect gene transcription. 

Conditions for incubating a nucleic acid molecule with a 
test sample vary. Incubation conditions depend on the format 
employed in the assay, the detection methods employed, and 
the type and nature of the nucleic acid molecule used in the 
assay. One skilled in the art will recognize that any one of 
the commonly available hybridization, amplification or 
array assay formats can readily be adapted to employ the 
novel fragments of the Human genome disclosed herein. 
Examples of such assays can be found in Chard, T, An 
Introduction to Radioimmunoassay and Related Techniques, 
Elsevier Science Publishers, Amsterdam, The Netherlands 
(1986); Bullock, G. R. et al., Techniques in 
Immunocytochemistry, Academic Press, Orlando, Fla. Vol. 1 
(1 982), Vol. 2 (1983), Vol. 3 (1985); Tljssen, P., Practice 
and Theory of Enzyme Immunoassays: Laboratory Tech- 
niques in Biochemistry and Molecular Biology, Elsevier 
Science Publishers, Amsterdam, The Netherlands (1985). 

The test samples of the present invention include cells, 
protein or membrane extracts of cells. The test sample used 
in the above-described method will vary based on the assay 
format, nature of the detection method and the tissues, cells 
or extracts used as the sample to be assayed. Methods for 
preparing nucleic acid extracts or of cells are well known in 
the art and can be readily be adapted in order to obtain a 
sample that is compatible with the system utilized. 

In another embodiment of the present invention, kits are 
provided which contain the necessary reagents to carry out 
the assays of the present invention. 

Specifically, the invention provides a compartmentalized 
kit to receive, in close confinement, one or more containers 
which comprises: (a) a first container comprising one of the 
nucleic acid molecules that can bind to a fragment of the 
Human genome disclosed herein; and (b) one or more other 
containers comprising one or more of the following: wash 
reagents, reagents capable of detecting presence of a bound 
nucleic acid. 

In detail, a compartmentalized kit includes any kit in 
which reagents are contained in separate containers. Such 
containers include small glass containers, plastic containers, 
strips of plastic, glass or paper, or arraying material such as 
silica. Such containers allows one to efficiently transfer 
reagents from one compartment to another compartment 
such that the samples and reagents are not cross- 
contaminated, and the agents or solutions of each container 
can be added in a quantitative fashion from one compart- 
ment to another. Such containers will include a container 
which will accept the test sample, a container which contains 
the nucleic acid probe, containers which contain wash 
reagents (such as phosphate buffered saline, Tris-buffers, 
etc.), and containers which contain the reagents used to 
detect the bound probe. One skilled in the art will readily 
recognize that the previously unidentified kinase gene of the 
present invention can be routinely identified using the 
sequence information disclosed herein can be readily incor- 
porated into one of the established kit formats which are well 
known in the art, particularly expression arrays. 
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Vectors/host Cells 

The invention also provides vectors containing the nucleic 
acid molecules described herein. The term "vector" refers to 
a vehicle, preferably a nucleic acid molecule, which can 
transport the nucleic acid molecules. When the vector is a 
nucleic acid molecule, the nucleic acid molecules are 
covalently linked to the vector nucleic acid. With this aspect 
of the invention, the vector includes a plasmid, single or 
double stranded phage, a single or double stranded RNAor 
DNA viral vector, or artificial chromosome, such as a BAC, 
PAC, YAC, OR MAC. 

A vector can be maintained in the host cell as an extra- 
chromosomal element where it replicates and produces 
additional copies of the nucleic acid molecules. 
Alternatively, the vector may integrate into the host cell 
genome and produce additional copies of the nucleic acid 
molecules when the host cell replicates. 

The invention provides vectors for the maintenance 
(cloning vectors) or vectors for expression (expression 
vectors) of the nucleic acid molecules. The vectors can 
function in prokaryotic or eukaryotic cells or in both (shuttle 
vectors). 

Expression vectors contain cis-acting regulatory regions 
that are operably linked in the vector to the nucleic acid 
molecules such that transcription of the nucleic acid mol- 
ecules is allowed in a host cell. The nucleic acid molecules 
can be introduced into the host cell with a separate nucleic 
acid molecule capable of affecting transcription. Thus, the 
second nucleic acid molecule may provide a trans-acting 
factor interacting with the cis-regulatory control region to 
allow transcription of the nucleic acid molecules from the 
vector. Alternatively, a trans-acting factor may be supplied 
by the host cell. Finally, a trans- acting factor can be pro- 
duced from the vector itself It is understood, however, that 
in some embodiments, transcription and/or translation of the 
nucleic acid molecules can occur in a cell-free system. 

The regulatory sequence to which the nucleic acid mol- 
ecules described herein can be operably linked include 
promoters for directing mRNA transcription. These include, 
but are not limited to, the left promoter from bacteriophage 
X, the lac, TRP, and TAC promoters from E. co!i t the early 
and late promoters from SV40, the CMV immediate early 
promoter, the adenovirus early and late promoters, and 
retrovirus long-terminal repeats. 

In addition to control regions that promote transcription, 
expression vectors may also include regions that modulate 
transcription, such as repressor binding sites and enhancers. 
Examples include the SV40 enhancer, the cytomegalovirus 
immediate early enhancer, polyoma enhancer, adenovirus 
enhancers, and retrovirus LTR enhancers. 

In addition to containing sites for transcription initiation 
and control, expression vectors can also contain sequences 
necessary for transcription termination and, in the tran- 
scribed region a ribosome binding site for translation. Other 
regulatory control elements for expression include initiation 
and termination codons as well as polyadenylation signals. 
The person of ordinary skill in the art would be aware of the 
numerous regulatory sequences that are useful in expression 
vectors. Such regulatory sequences are described, for 
example, in Sambrook et al., Molecular Cloning: A Labo- 
ratory Manual. 2nd. ed., Cold Spring Harbor Laboratory 
Press, Cold Spring Harbor, N.Y., (1989). 

A variety of expression vectors can be used to express a 
nucleic acid molecule. Such vectors include chromosomal, 
episomal, and virus-derived vectors, for example vectors 
derived from bacterial plasmids, from bacteriophage, from 
yeast episomes, from yeast chromosomal elements, includ- 
ing yeast artificial chromosomes, from viruses such as 
baculoviruses, papovaviruses such as SV40, Vaccinia 
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viruses, adenoviruses, poxviruses, pseudorabies viruses, and 
retroviruses. Vectors may also be derived from combinations 
of these sources such as those derived from plasmid -and 
bacteriophage genetic elements, e.g. cosmids and 
5 phagemids. Appropriate cloning and expression vectors for 
prokaryotic and eukaryotic hosts are described in Sambrook 
et al, Molecular Cloning: A Laboratory Manual. 2nd. ed;, 
Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 
N.Y., (1989). 

The regulatory sequence may provide constitutive expres- 
io sion in one or more host cells (i.e. tissue specific) or may 
provide for inducible expression in one or more cell types 
such as by temperature, nutrient additive, or exogenous 
factor such as a hormone or other ligand. A variety of vectors 
providing for constitutive and inducible expression in 
j s prokaryotic and eukaryotic hosts are well known to those of 
ordinary skill in the art. 

The nucleic acid molecules can be inserted into the vector 
nucleic acid by well-known methodology. Generally, the 
DNA sequence that will ultimately be expressed is joined to 
an expression vector by cleaving the DNA sequence and the 
20 expression vector with one or more restriction enzymes and 
then ligating the fragments together. Procedures for restric- 
tion enzyme digestion and ligation are well known to those 
of ordinary skill in the art. 

The vector containing the appropriate nucleic acid mol- 
25 ecule can be introduced into an appropriate host cell for 
propagation or expression using well-known techniques. 
Bacterial cells include, but are not limited to, E. coli, 
Streptomyces, and Salmonella typhimurium. Eukaryotic 
cells include, but are not limited to, yeast, insect cells such 
30 as Drosophila, animal cells such as COS and CHO cells, and 
plant cells. 

As described herein, it may be desirable to express the 
peptide as a fusion protein. Accordingly, the invention 
provides fusion vectors that allow for the production of the 
peptides. Fusion vectors can increase the expression of a 
recombinant protein, increase the solubility of the recombi- 
nant protein, and aid in the purification of the protein by 
acting for example as a ligand for affinity purification. A 
proteolytic cleavage site may be introduced at the junction 
of the fusion moiety so that the desired peptide can ulti- 
40 mately be separated from the fusion moiety. Proteolytic 
enzymes include, but are not limited to, factor Xa, thrombin, 
and enterokinase. Typical fusion expression vectors include 
pGEX (Smith et al., Gene 67:31-40 (1988)), pMAL (New 
England Biolabs, Beverly, Mass.) and pRIT5 (Pharmacia, 
45 Piscataway, NJ.) which fuse glutathione S-transferase 
(GST), maltose E binding protein, or protein A, respectively, 
to the target recombinant protein. Examples of suitable 
inducible non-fusion E. coli expression vectors include pTrc 
(Amann et al., Gene 69:301-315 (1988)) and pET 11 d 
50 (Studier et al., Gene Expression Technology: Methods in 
Enzymology 185:60-89 (1990)). 

Recombinant protein expression can be maximized in 
host bacteria by providing a genetic background wherein the 
host cell has an impaired capacity to proteolytically cleave 
55 the recombinant protein. (Gottesman, S., Gene Expression 
Technology: Methods in Enzymology 185, Academic Press, 
San Diego, Calif. (1990) 119-128). Alternatively, the 
sequence of the nucleic acid molecule of interest can be 
altered to provide preferential codon usage for a specific 
host cell, for example E. coli. (Wada et al., Nucleic Acids 
60 Res. 20:2111-2118 (1992)). 

The nucleic acid molecules can also be expressed by 
expression vectors that are operative in yeast. Examples of 
vectors for expression in yeast e.g., 5. cerevisiae include 
pYepSecl (Baldari, et al., EMBO J. 6:229-234 (1987)), 
65 pMFa (Kurjan et al., Cell 30:933-943(1982)), pJRY88 
(Schultz et al., Gene 54:113-123 (1987)), and pYES2 
(Invitrogen Corporation, San Diego, Calif.). 
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The nucleic acid molecules can also be expressed in insect 
cells using, for example, baculovirus expression vectors. 
Baculovims vectors available for expression of proteins in 
cultured insect cells (e.g., Sf 9 cells) include the pAc series 
(Smith et al., Mol Cell Biol. 3:2156-2165 (1983)) and the 
pVL series (Lucklow et al., Virology 170:31-39 (1989)). 

In certain embodiments of the invention, the nucleic acid 
molecules described herein are expressed in mammalian 
cells using mammalian expression vectors. Examples of 
mammalian expression vectors include pCDM8 (Seed, B. 
Nature 329:840(1987)) and pMT2PC (Kaufman et al, 
EMBOJ. 6:187-195 (1987)). 

The expression vectors listed herein are provided by way 
of example only of the well-known vectors available to 
those of ordinary skill in the art that would be useful to 
express the nucleic acid molecules. The person of ordinary 
skill in the art would be aware of other vectors suitable for 
maintenance propagation or expression of the nucleic acid 
molecules described herein. These are found for example in 
Sambrook, J., Fritsh, E. R, and Maniatis, T. Molecular 
Cloning: A Laboratory Manual. 2nd, ed., Cold Spring 
Harbor Laboratory, Cold Spring Harbor Laboratory Press, 
Cold Spring Harbor, N.Y., 1989. 

The invention also encompasses vectors in which the 
nucleic acid sequences described herein are cloned into the 
vector in reverse orientation, but operably linked to a 
regulatory sequence that permnits transcription of antisense 
RNA. Thus, an antisense transcript can be produced to all, 
or to a portion, of the nucleic acid molecule sequences 
described herein, including both coding and non-coding 
regions. Expression of this antisense RNA is subject to each 
of the parameters described above in relation to expression 
of the sense RNA (regulatory sequences, constitutive or 
inducible expression, tissue-specific expression). 

The invention also relates to recombinant host cells 
containing the vectors described herein. Host cells therefore 
include prokaryotic cells, lower eukaryotic cells such as 
yeast, other eukaryotic cells such as insect cells, and higher 
eukaryotic cells such as mammalian cells. 

The recombinant host cells are prepared by introducing 
the vector constructs described herein into the cells by 
techniques readily available to the person of ordinary skill in 
the art. These include, but are not limited to, calcium 
phosphate transfection, DEAE-dextran-mediated 
transfection, cationic lipid-mediated transfection, 
electroporation, transduction, infection, lipofection, and 
other techniques such as those found in Sambrook, et al. 
(Molecular Cloning: A Laboratory Manual. 2nd, ed, Cold 
Spring Harbor Laboratory, Cold Spring Harbor Laboratory 
Press, Cold Spring Harbor, N.Y, 1989). 

Host cells can contain more than one vector. Thus, dif- 
ferent nucleotide sequences can be introduced on different 
vectors of the same cell. Similarly, the nucleic acid mol- 
ecules can be introduced either alone or with other nucleic 
acid molecules that are not related to the nucleic acid 
molecules such as those providing trans-acting factors for 
expression vectors. When more than one vector is intro- 
duced into a cell, the vectors can be introduced 
independently, co-introduced or joined to the nucleic acid 
molecule vector. 

In the case of bacteriophage and viral vectors, these can 
be introduced into cells as packaged or encapsulated virus 
by standard procedures for infection and transduction. Viral 
vectors can be replication-competent or replication- 
defective. In the case in which viral replication is defective, 
replication will occur in host cells providing functions that 
complement the defects. 

Vectors generally include selectable markers that enable 
the selection of the subpopulation of cells that contain the 



recombinant vector constructs. The marker can be contained 
in the same vector that contains the nucleic acid molecules 
described herein or may be on a separate vector. Markers 
include tetracycline or ampicillin-resistance genes for 
5 prokaryotic host cells and dihydrofolate reductase or neo- 
mycin resistance for eukaryotic host cells. However, any 
marker that provides selection for a phenotypic trait will be 
effective. 

While the mature proteins can be produced in bacteria, 
10 yeast, mammalian cells, and other cells under the control of 
the appropriate regulatory sequences, cell-free transcription 
and translation systems can also be used to produce these 
proteins using RNA derived from the DNA constructs 
described herein. 

15 Where secretion of the peptide is desired, which is diffi- 
cult to achieve with multi-transmembrane domain contain- 
ing proteins such as kinases, appropriate secretion signals 
are incorporated into the vector. The signal sequence can be 
endogenous to the peptides or heterologous to these pep- 
tides. 

20 

Where the peptide is not secreted into the medium, which 
is typically the case with kinases, the protein can be isolated 
from the host cell by standard disruption procedures, includ- 
ing freeze thaw, sonication, mechanical disruption, use of 
lysing agents and the like. The peptide can then be recovered 

25 and purified by well-known purification methods including 
ammonium sulfate precipitation, acid extraction, anion or 
cationic exchange chromatography, phosphocellulose 
chromatography, hydrophobic-interaction chromatography, 
affinity chromatography, hydroxylapatite chromatography, 

30 lectin chromatography, or high performance liquid chroma- 
tography. 

It is also understood that depending upon the host cell in 
recombinant production of the peptides described herein, the 
peptides can have various glycosylation patterns, depending 
35 upon the cell, or maybe non-glycosylated as when produced 
in bacteria. In addition, the peptides may include an initial 
modified methionine in some cases as a result of a host- 
mediated process. 
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Uses of Vectors and Host Cells 



The recombinant host cells expressing the peptides 
described herein have a variety of uses. First, the cells are 
useful for producing a kinase protein or peptide that can be 
further purified to produce desired amounts of kinase protein 
45 or fragments. Thus, host cells containing expression vectors 
are useful for peptide production. 

Host cells are also useful for conducting cell-based assays 
involving the kinase protein or kinase protein fragments, 
such as those described above as well as other formats 
50 known in the art. Thus, a recombinant host cell expressing 
a native kinase protein is useful for assaying compounds that 
stimulate or inhibit kinase protein function. 

Host cells are also useful for identifying kinase protein 
mutants in which these functions are affected. If the mutants 
naturally occur and give rise to a pathology, host cells 
containing the mutations are useful to assay compounds that 
have a desired effect on the mutant kinase protein (for 
example, stimulating or inhibiting function) which may not 
be indicated by their effect on the native kinase protein. 
Genetically engineered host cells can be further used to 

60 produce non-human transgenic animals. A transgenic animal 
is preferably a mammal, for example a rodent, such as a rat 
or mouse, in which one or more of the cells of the animal 
include a transgene. A transgene is exogenous DNA which 
is integrated into the genome of a cell from which a 

65 transgenic animal develops and which remains in the 
genome of the mature animal in one or more cell types or 
tissues of the transgenic animal. These animals are useful for 
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studying the function of a kinase protein and identifying and 
evaluating modulators of kinase protein activity. Other 
examples of transgenic animals include non-human 
primates, sheep, dogs, cows, goats, chickens, and amphib- 
ians. 

A transgenic animal can be produced by introducing 
nucleic acid into the male pronuclei of a fertilized oocyte, 
e.g., by microinjection, retroviral infection, and allowing the 
oocyte to develop in a pseudopregnant female foster animal. 
Any of the kinase protein nucleotide sequences can be 
introduced as a transgene into the genome of a non-human 
animal, such as a mouse. 

Any of the regulatory or other sequences useful in expres- 
sion vectors can form part of the transgenic sequence. This 
includes intronic sequences and polyadenylation signals, if 
not already included. A tissue-specific regulatory sequence 
(s) can be operably linked to the transgene to direct expres- 
sion of the kinase protein to particular cells. 

Methods for generating transgenic animals via embryo 
manipulation and microinjection, particularly animals such 
as mice, have become conventional in the art and are 
described, for example, in U.S. Pat. Nos. 4,736,866 and 
4,870,009, both by Leder et al, U.S. Pat. No. 4,873,191 by 
Wagner et al. and in Hogan, B., Manipulating the Mouse 
Embryo, (Cold Spring Harbor Laboratory Press, Cold Spring 
Harbor, N.Y., 1986). Similar methods are used for produc- 
tion of other transgenic animals. A transgenic founder ani- 
mal can be identified based upon the presence of the 
transgene in its genome and/or expression of transgenic 
mRNA in tissues or cells of the animals. A transgenic 
founder animal can then be used to breed additional animals 
carrying the transgene. Moreover, transgenic animals carry- 
ing a transgene can further be bred to other transgenic 
animals carrying other transgenes. A transgenic animal also 
includes animals in which the entire animal or tissues in the 
animal have been produced using the homologously recom- 
binant host cells described herein. 

In another embodiment, transgenic non-human animals 
can be produced which contain selected systems that allow 
for regulated expression of the transgene. One example of 
such a system is the cre/loxP recombinase system of bacte- 
riophage PI. For a description of the cre/loxP recombinase 
system, see, e.g., Lakso et al. PNAS 89:6232-^236 (1992). 
Another example of a recombinase system is the FLP 
recombinase system of 5. cerevisiae (0' Gorman et al. Sci- 
ence 251:1351-1355 (1991). If a cre/loxP recombinase 
system is used to regulate expression of the transgene, 
animals containing transgenes encoding both the Cre recom- 
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binase and a selected protein is required. Such animals can 
be provided through the construction of "double" transgenic 
animals, e.g., by mating two transgenic animals, one con- 
taining a transgene encoding a selected protein and the other 
containing a transgene encoding a recombinase. 

Clones of the non-human transgenic animals described 
herein can also be produced according to the methods 
described in Wilmut, I. et al. Nature 385:810-813 (1997) 
and PCT International Publication Nos. WO 97/07668 and 
WO 97/07669. In brief, a cell, e.g., a somatic cell, from the 
transgenic animal can be isolated and induced to exit the 
growth cycle and enter G 0 phase. The quiescent cell can then 
be fused, e.g., through the use of electrical pulses, to an 
enucleated oocyte from an animal of the same species from 
which the quiescent cell is isolated. The reconstructed 
oocyte is then cultured such that it develops to morula or 
blastocyst and then transferred to pseudopregnant female 
foster animal. The offspring born of this female foster 
animal will be a clone of the animal from which the cell, e.g., 
the somatic cell, is isolated. 

Transgenic animals containing recombinant cells that 
express the peptides described herein are useful to conduct 
the assays described herein in an in vivo context. 
Accordingly, the various physiological factors that are 
present in vivo and that could effect substrate binding, 
kinase protein activation, and signal transduction, may not 
be evident from in vitro cell-free or cell-based assays. 
Accordingly, it is useful to provide non-human transgenic 
animals to assay in vivo kinase protein function, including 
substrate interaction, the effect of specific mutant kinase 
proteins on kinase protein function and substrate interaction, 
and the effect of chimeric kinase proteins. It is also possible 
to assess the effect of null mutations, that is, mutations that 
substantially or completely eliminate one or more kinase 
protein functions. 

All publications and patents mentioned in the above 
specification are herein incorporated by reference. Various 
modifications and variations of the described method and 
system of the invention will be apparent to those skilled in 
the art without departing from the scope and spirit of the 
invention. Although the invention has been described in 
connection with specific preferred embodiments, it should 
be understood that the invention as claimed should not be 
unduly limited to such specific embodiments. Indeed, vari- 
ous modifications of the above-described modes for carrying 
out the invention which are obvious to those skilled in the 
field of molecular biology or related fields are intended to be 
within the scope of the following claims. 



SEQUENCE LISTING 

<160> NUMBER OF SEQ ID NOS: 4 

<210> SEQ ID NO 1 
<211> LENGTH: 2320 
<212> TYPE: DNA 
<213> ORGANISM: Human 

<400> SEQUENCE: 1 

cccagggcgc cgtaggcggt gcatcccgtt cgcgcctggg gctgtggtct tcccgcgcct 60 

gaggcggcgg cggcaggagc tgaggggagt tgtagggaac tgaggggagc tgctgtgtcc 120 

cccgcctcct cctccccatt tccgcgctcc cgggaccatg tccgcgctgg cgggtgaaga 180 

tgtctggagg tgtccaggct gtggggacca cattgctcca agccagatat ggtacaggac 240 
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-continued 



tgtcaacgaa acctggcacg gctcttgctt ccggtgaaag tgatgcgcag cctggaccac 300 

cccaatgtgc tcaagttcat tggtgtgctg tacaaggata agaagctgaa cctgctgaca 360 

gagtacattg aggggggcac actgaaggac tttctgcgca gtatggatcc gttcccctgg 420 

cagcagaagg tcaggtttgc caaaggaatc gcctccggaa .tggacaagac tgtggtggtg 480 

gcagactttg ggctgtcacg gctcatagtg gaagagagga aaagggcccc catggagaag 540 

gccaccacca agaaacgcac cttgcgcaag aacgaccgca agaagcgcta cacggtggtg 600 

ggaaacccct actggatggc ccctgagatg ctgaacggaa agagctatga tgagacggtg 660 

gatatcttct cctttgggat cgttctctgt gagatcattg ggcaggtgta tgcagatcct 720 

gactgccttc cccgaacact ggactttggc ctcaacgtga agcttttctg ggagaagttt 780 

gttcccacag attgtccccc ggccttcttc ccgctggccg ccatctgctg cagactggag 840 

cctgagagca gaccagcatt ctcgaaattg gaggactcct ttgaggccct ctccctgtac 900 

ctgggggagc tgggcatccc gctgcctgca gagctggagg agttggacca cactgtgagc 960 

atgcagtacg gcctgacccg ggactcacct ccctagccct ggcccagccc cctgcagggg 1020 

ggtgttctac agccagcatt gcccctctgt gccccattcc tgctgtgagc agggccgtcc 1080 

gggcttcctg tggattggcg gaatgtttag aagcagaaca aaccattcct attacctccc 1140 

caggaggcaa gtgggcgcag caccagggaa atgtatctcc acaggttctg gggcctagtt 1200 

actgtctgta aatccaatac ttgcctgaaa gctgtgaaga agaaaaaaac ccctggcctt 1260 

tgggccagga ggaatctgtt actcgaatcc acccaggaac tccctggcag tggattgtgg 1320 

gaggctcttg cttacactaa tcagcgtgac ctggacctgc tgggcaggat cccagggtga 1380 

acctgcctgt gaactctgaa gtcactagtc cagcrtgggtg caggaggact tcaagtgtgt 1440 

ggacgaaaga aagactgatg gctcaaaggg tgtgaaaaag tcagtgatgc tccccctttc 1500 

tactccagat cctgtccttc ctggagcaag gttgagggag taggttttga agagtccctt 1560 

aatatgtggt ggaacaggcc aggagttaga gaaagggctg gcttctgttt acctgctcac 1620 

tggctctagc cagcccaggg accacatcaa tgtgagagga agcctccacc tcatgttttc 1680 

aaacttaata ctggagactg gctgagaact tacggacaac atcctttctg tctgaaacaa 1740 

acagtcacaa gcacaggaag aggctggggg actagaaaga ggccctgccc tctagaaagc 1800 

tcagatcttg gcttctgtta ctcatactcg ggtgggctcc ttagtcagat gcctaaaaca 1860 

ttttgcctaa agctcgatgg gttctggagg acagtgtggc ttgtcacagg cctagagtct 192 0 

gagggagggg agtgggagtc tcagcaatct cttggtcttg gcttcatggc aaccactgct 1980 

cacccttcaa catgcctggt ttaggcagca gcttgggctg ggaagaggtg gtggcagagt 2040 

ctcaaagctg agatgctgag agagatagct ccctgagctg ggccatctga cttctacctc 2100 

ccatgtttgc tctcccaact cattagctcc tgggcagcat cctcctgagc cacatgtgca 2160 

ggtactggaa aacctccatc ttggctccca gagctctagg aactcttcat cacaactaga 2220 

tttgcctctt ctaagtgtct atgagcttgc accatattta ataaattggg aatgggtttg 2280 

gggtattaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa 2320 

<210> SEQ ZD NO 2 
<211> LENGTH: 255 
<212> TYPE: PRT 
<213> ORGANISM: Human 

<400> SEQUENCE: 2 

Met Val Gin Asp Cys Gin Arg Asn Leu Ala Arg Leu Leu Leu Pro Val 



US 6,340,583 Bl 
39 40 

-continued 



-i-5 10 is 

Lys Val Met Arg Ser Leu Asp Hia Fro Asn Val Leu Lys Phe He Gly 
20 25 30 . 

Val Leu Tyr Lye Asp Lys Lye Leu Asn Leu Leu Thr Glu Tyr He Glu 
35 40 45 

Gly Gly Thr Leu Lys Asp Phe Leu Arg Ser Met Asp Pro Phe Pro Trp 
50 55 60 

Gin Gin Lys Val Arg Phe Ala Lys Gly He Ala Ser Gly Met Asp Lys 
65 70 75 80 

Thr Val Val Val Ala Asp Phe Gly Leu Ser Arg Leu He Val Glu Glu 

85 90 95 

Arg Lys Arg Ala Pro Met Glu Lys Ala Thr Thr Lys Lys Arg Thr Leu 
100 105 110 

Arg Lye Asn Asp Arg Lys Lys Arg Tyr Thr Val Val Gly Asn Pro Tyr 
115 120 125 

Trp Met Ala Pro Glu Met Leu Asn Gly Lys Ser Tyr Asp Glu Thr Val 
130 135 140 

Asp He Phe Ser Phe Gly He Val Leu Cys Glu He lie Gly Gin Val 
145 150 155 160 

Tyr Ala Asp Pro Asp Cys Leu Pro Arg Thr Leu Asp Phe Gly Leu Asn 

165 170 175 

Val Lys Leu Phe Trp Glu Lys Phe Val Pro Thr Asp Cys Pro Pro Ala 
180 185 190 

Phe Phe Pro Leu Ala Ala He Cys Cys Arg Leu Glu Pro Glu Ser Arg 
195 200 205 

Pro Ala Phe Ser Lye Leu Glu Asp Ser Phe Glu Ala Leu Ser Leu Tyr 
210 215 220 

Leu Gly Glu Leu Gly He Pro Leu Pro Ala Glu Leu Glu Glu Leu Asp 
225 230 235 240 

His Thr Val Ser Met Gin Tyr Gly Leu Thr Arg Asp Ser Pro Pro 

245 250 255 



<210> SEQ ID NO 3 

<211> LENGTH t 59065 

<212> TYPE I DNA 

<213> ORGANISM: Human 

<400> SEQUENCE: 3 



tcatccttgc gcaggggcca 


tgctaacctt 


ctgtgtctca gtccaatttt 


aatgtatgtg 


60 


ctgctgaagc gagagtacca 


gaggtttttt 


tgatggcagt gacttgaact 


tatttaaaag 


120 


ataaggagga gccagtgagg 


gagaggggtg 


ctgtaaagat aactaaaagt 


gcacttcttc 


180 


taagaagtaa gatggaatgg 


gatccagaac 


aggggtgtca taccgagtag 


cccagccttt 


240 


gttccgtgga cactggggag 


tctaacccag 


agctgagata gcttgcagtg 


tggatgagcc 


300 


agctgagtac agcagatagg 


gaaaagaagc 


caaaaatctg aagtagggct 


ggggtgaagg 


360 


acagggaagg gctagagaga 


catttggaaa 


gtgaaaccag gtggatatga 


gaggagagag 


420 


tagagggtct tgatttcggg 


tctttcatgc 


ttaacccaaa gcaggtacta 


aagtatgtgt 


480 


tgattgaatg tctttgggtt 


tctcaagact ggagaaagca gggcaagctc 


tggagggtat 


540 


ggcaataaca agttatcttg 


aatatcctca 


tggtggaaag tcctgatcct 


gtttgaattt 


600 


tggaaataga aatcattcag 


agccaagaga ttgaattgtt gagtaagtgg 


gtggtcaggt 


660 


tacagactta attttgggtt 


aaaaagtaaa 


aacaagaaac aaggtgtggc 


tctaaaataa 


720 
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tgagatgtgc tgggggtggg gcatggcagc tcataaactg accctgaaag ctcttacatg 780 

taagagttcc aaaaatattt ccaaaacttg gaagattcat ttggatgttt gtgttcatta 840 

aaatctctca ctaattcatt gtcttgtcca ctgtccgtaa cccaacctgg gattggtttg 900 

agtgagtctc tcagactttc tgccttggag tttgtgagag agatggcata ctctgtgacc 960 

actgtcaccc taaaaccaaa aaggcccctc ttgacaagga gtctgaggat tttagaccca 1020 

ggaagaatga gtgatgggca tatatatatc ctattactga ggcatgagaa gagtggaatg X080 

ggtgggttga ggtggtgttt taaggcctct tgccagcttg tttaactctt ctctggggaa 1140 

cgagggggac aactgtgtac attggctgct ccagaatgat gttgagcaat cttgaagtgc 1200 

caggagctgt gctttgtcta ttcatggccc ctgtgcctgt gaaacagggt tcggtgactg 1260 

tcactgtgcc tgtggcagtc tgtagttacc cagagagaac aaagctgcat acacagagcg 1320 

cacaagggag tcttgtaaca accttgtcct gctttctagg gctgagtcag gtaccacagc 1380 

ttgatctcag ctgtcctctt tatttcaaga agttgacatc tgagccatac caggagtatt 1440 

gtattttgtt tgaggcctct ctttttggag gaacatggac cgactctgtg cttttgtcta 1500 

tgctggtctc tgagctcaca caacccttca ccctcctttc tcagccagtg ataggtaagt 1560 

cttccctatc ttgcaaggct cagctcaagt gtcagcttcc tctacaaaga ctttcctggt 1620 

tcccctca.tt ggagtgaaca agagttgaca tggtagaatg gaaagagcag aagctttaga 1680 

atgagccaga cctgagtatg aatgctagat ccaccactta gctagtcaac cctgccccct 1740 

gcctcaagtt ttaattttcc tatccattaa gtgaatataa taatacctgt gtcacaggat 1800 

tattttgaga attaaatgag attaggtcta tgaaagcacc tagcagagtt cttggcatat 1860 

aggaggcatt cattaaatat ttgttcttcc ccttttatac ccattacttt tctttttctg 1920 

aactaaaata atacttggtt ctatctctga aataacatcc aagtgaaaaa tcaacaacat 1980 

gaaagagcag ttcttttcca gtggatttgc ttcttaagga gcagagatta tgtaatctaa 2040 

cagcctccaa catacaaaga gctttgtatc tagaacaggg gtccccagcc cctggaccgc 2100 

caactggtac gggtctgtag cctgttagga accaggctgc acagcaggag gtgagcggcg 2160 

ggccagtgag cattgctgcc tgagctctgc ctcctgtcag atcagtggtg gcattagatt 2220 

ctcataggag tgtgaaccct attgtgaact gcacatgcaa gggatctggg ttgcatgctc 2280 

cttatgagaa tctcactaat ggctgatgat ctgagttgga acagtttgat accaaaacca 2340 

tccccccgcc ccccaacccc cagcctaggg tccgtggaaa aattggcccc tggtgccaaa 2400 

aaggttgagg actgctgatc tagaggacca atttattcaa tgttggttga gtaaatgagc 2460 

tcttggatta ggtgatggaa aaatctgaaa aaacagggct tttgaggaat aggaaaaggc 2520 

agtaacatgt ttaacccaga gagaagtttc tggctgttgg ctgggaatag tcataggaag 2580 

ggctgacact gaaaagaagg agattgtgtt cgtttcttct tctcagagct ataagcaaag 2640 

gctgaaagtt ctagaaaaag gcaagttttg tttcagtaga aaaaaggata atcagaacca 2700 

tttttagaaa atggaatgag actacttttg aggccatgag ttccttgtcc ctggagagat 2760 

gagcagaggt tggacaagtg cttaccagag atcttgtgga ggcagaaact gtgcatctag 2820 

cagagcattg gcctaaccct ttcaaatgag atgctgttaa ctcagtctta ttctacatgg 2880 

taggaatcct gtccctttgc ctcctgctac tttgggcctc tcaacctctt ggttttgtgt 2940 

gcaggtgaag atgtctggag gtgtccaggc tgtggggacc acattgctcc aagccagata 3000 

tggtacagga ctgtcaacga aacctggcac ggctcttgct tccggtaggt gggcctatcc 3060 

tcccatcttt accagtgtac tatgggccaa gcactatttc atgttctgat ggaaaacaca 3120 
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gaaacaagct tctgagttga gaatttcaat cttagggtgg ggaaaggaat gtaccaagga 3180 

agagctcatg accaaacctc aagtgtggcc cccctgaacc caggttaaat tggaagagcc 3240 

ataaatgggc cagctggagg cagggtgggg ggatgagagg agccctttcc agggttgtcc 3300 

catatccctc actttatggg tgaggaaact gaggcccagg aagagtgact ttcctgtggc 3360 

tgcactacag attatgcagg tacttcaaga gttgtttgta ttcttatttt attttatttt 3420 

ottttatttt attttatttt attttatgag agggattctt gctgttgccc aggctggagt 3480 

gcagtggtgc aatctcggct cactgcaatc tctgcctgct gggttcaagt gatttttctg 3540 

ccttagcttc ctgagtagct gagatgacag gcacctgcca ccatgcgcag ctaatttttg 3600 

tattttagtg gagacggggg tttcaacatg ttggtcaggc tggtcttgaa ctcctgacct 3660 

caaatgatgc acccacctcg acctcccaaa gtgctggaat tacaggcgtg aaccactgtg 3720 

cccagccaag agttgttttt agtgtggttg gcagagccag ctcttccttc accacaggat 3780 

gcctccctag gttcctactt tttgttacta gcttttatta tagctatatt attattatta 3840 

ttattattat tattattatt attattgaga cagagtctcg ctctgtcgcc caggctggtg 3900 

tacagtggtg cgatcccggg ctcactgcaa cctctgcctc ccgagttcaa gcagttctcc 3960 

tgcctcagcc ccccgagtag gtgggactac aggcgcctgc caccacaccc ggctaatttt 4020 

tgtattttta gtagagacgg ggtttcacct tgttgaccag gctggtctgg agctcctgac 4080 

ctcaggtaag tgctagaatc acaggcgtga accactgcgc ccagccaaga gttgttttta 4140 

gtgtggttgg cagagccagc tcttcctcac cacaggttgc ctccctaggt tcctactttt 4200 

tgttactagc tttattatag ctacattatt attattattg ttattattat tgagacagag 4260 

tctcgctctg tcgcccaggc tggtgtacag tgatgtgatc ttggctcact gcaacctctg 4320 

ccccccgagt tcaagcaatt ctcctgcttc agccccccta gtaggtggga ctccaggcac 4380 

ctgccaccac gcccagctaa tttttgtatt tttagtagag gcggggtttc accttgttgg 4 440 

ccaggctggt ctcaaactcc tgacctcagg tgatccgcct gcctcggcct cccaaaatgt 4500 

tgggattaca ggcatgagcc accgcgccct gcctatagct acattatttt tgtaggcagc 4560 

tcagtttctt aaaaattata cagacttcaa atcagatttg ttcctgctgt ctgaggctca 4620 

gtttcttcat ctggaaaatg gatggtaata atcttgttga gattgaatga aataatatat 4680 

gcagtgtatc cagtacatgg tagacaccca gtgaatggtt attccttcct cccatcggat 4740 

tggaattctc aagggtggga acttgtcttt atattcttca caacgtaaaa tagttgaaat 4800 

ttgttggtgg aaagaagagc agtccactcc agaggctgga tgggcatgcc tggcccccaa 4860 

ggtctgaagt ggtagggctg tgcctatatc ctgagaatga gatagactag gcaggcacct 4920 

tgtgctgtag attccagctc ctgcacatag ctcttgttgt aaaacatccc tgtgcttata 4 980 

ccaagtaatt gagttgacct ttaaacactt gcctcttccc tgggaaccat ataggggatt 5040 

ggcctggaga cgtctggcct ctggaagagt tggaaagcag ccatcattat tatcctttcc 5100 

tttcagctat aactcagagc tctcaagtct tttctgtgga tcttattgcc ttggttcttg 5160 

ccccttttac tcccagggaa gttgattctg tcttttctgt tccatttagt atgacaggag 5220 

cagagaatgt cagagctgta agggacctta tagttaaagc ctttggctgg tcctttcatt 5280 

ttatagctgg gactaataag taacgtcaaa acccaatgag ttcacagatt gggtctcgcc 5340 

ttggcatgta acccatatgt tcatattctt gctgttttcc tatgtgtatg aatattttct 5400 

atccaaaata agcaggacag ggtagagcaa gttaatcttt ggaatttctg gattctctta 5460 
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gagctaaaaa acttcagaac tagaagaaac cacccactat atggtataac ccattcatat 5520 
cacagatgag gcctgaaacc aaaaagactt gctcaggcca tggatgacaa gagctggccc 5580 

tagcactgaa ctcttgggtc atttgtaggt ctagtcagat gctagcttgt tagctctgtg 5640 

cgtgcgtgtg tgtgtgtgtg tgtgtgtgtg tgtgtgagat agagacagaa agataacata 5700 

tgtacacaaa tacataaaga ggaagtagac acgttagcat ggtagataag agtacaggca 5760 

ggccaggcgt ggtggctcac gcctgtaatc ccagcacttt gggaggccaa ggcaggtgga 5820 

tcacctgagg tcaggaattc gagaccagcc tgaccaacat ggtgaaaccc catctctact 5880 

aaatacagaa aaaaattagc ttggcatggt ggcacatgcc tgtaatccca gctacttggg 5940 

aagctgaagc aggagaatcg cttgaatccg ggaagcagaa gttgcagtga gccgagattg 6000 

tgccattaca gtctagcctg ggcaacaaga gggaaactcc atcgcaaaaa aacaaccacc 6060 

accaagagta caggctatgg aatgagacta tggttttaaa tcctggcttt gcaatttatt 6120 

aactagcctt aagtgacttc cctgagcttc aggcaccaat ctgtaaaatg aggataagaa 6180 

tattactcat gccacatggt tgttagggag gattaaatgt gataacctat ataaagtggc 6240 

tagcatagca tctgacatat agaaaactct taatagggcc ggacgtggtg gcttatgcct 6300 

gtaatcctag cactctggga ggccgaggca gaaggatcgc ttgagcccat gagcccagga 6360 

gtttgagacc agcctggcca acatggcaaa actccacctc tacaaaaaat acaaaaatat 6420 

tagccaggcg tgatggcaca cacctgtagt cccagctact tgggaagcrtg aggagcgatg 6480 

attacctgag cccagggata tcaaggctgt agtgagctgt gatcatgcca ctgtactcca 6540 

tccagctggg ggacagagtg aaacccctgt ctcaaaacaa aacaaatgaa aaaaaaaacc- 6600 

cttaataatc agtaactgtc actttatatt atgttgtgag tgtgtgtcta tatacaccta 6660 

tatgtataca tttctcttat tacacattca ttggtgatct gatgtggagc cccagggatt 6720 

aagggcaact ttgaactacc ctgacacaat caagccaaat atcattcccg tggaggaagt 6780 

agagtatcta ggttctgtct cctagttgca gctttacctt gaggacagag actctaatcc 6840 

agctgtgctg aaggagcaca tctcctgact tctgagcttt cccctggtaa attcaaactg 6900 

gatgtcacgg cgccctcaga tagagcctgg taatttgccc tggggagagt gactgtcttt 6960 

tggatctaat ttgacttttg ccccagttgg aggaaaatct tcagggctag gaaggattgt 7020 

atttgtctga ccccagagat aacctgggtt ttgaggaaca tggggcatca acctgaatgg 7080 

tcttgtaaga tctctcccac gccagcttgc cagtgtttct ctgatgaatt tagagtacct 7140 

gagtagtgca ggcctgctgg gaggaggact ctccctctgt gctactcaga gaaattcatt 7200 

cttcaaggcc cccttccagc cttgctctta cccagctggg ctacagttac aataaaggaa 7260 

atgacttttc ttctcccctt cccccagtac ctttgttttc ctagtcacag ggtggggctg 7320 

gatattgaat ggagaaattg ctggggtcca tcctaaactc ctcccctcat ctctccctta 7380 

cattacccca ttcttctgtc tgcagccaca tccataatcc tgcctctgtt agccttccga 7440 

cagaccctca ggtgcccagg acaacaggaa gctacttaaa gctggaacct cagactgtgc 7500 

aatggaggcc agtgacaaaa ctgaaagtag ctctgtcagt aattgtgctg gtgcgattag 7560 

gcagctggcc agaatctttt ggatctcctg gacatatggc tgactagtcc tcccaagcct 7620 

tcccaacagg cctctttttt ttcctttttt tcttttcttt tttttctttc tttctttctt 7680 

tctttttttt ttttttttag gctagtgaag tgaaattgtg ggagtggaaa aggaacaaag 7740 

aaatcggtaa ctggtagtga tcaattactt gtaaacacta ttgtacttgg accagcccag 7800 

taggcctttt ttaaaactct gagttacctc tctttccttt ccttgagcag tgccattaat 7860 
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tctgtatctg 


gggcaatcct ttctgatgtt 


ctctggacct 


ggctctctct 


ccttaggaga 


7920 


ggccaggaga 


gtagccagag age at gt cat 


ttgtagctga 


ggttaaagtg 


tggagctatc 


7980 


aatggtgacc tggcctcttg gcatgttagc 


aagecagagg 


accttgacaa 


cttttttgat 


8040 


gattgtccgt tcaccctgat caaaggtgtt 


tggcttagga 


ggagggaaga 


aaagctaccc 


8100 


ctattagtct 


tgatggcccc agcgtgggtc 


tetattgett 


gacctggttc 


ctagcagcat 


8160 


tatcagaagg 


aaaatccacc gctcttaagg 


ctcctgggaa 


ctttcaggac 


ttcctttctc 


8220 


aggattgcaa 


acataagact atttgagctt 


tcacttttga 


aaagcggtta 


ctaataccta 


8280 


tac-tctggga 


aagggctaat gcagatagaa 


gactgtggtc 


actgeatcag 


gcaacagacc 


8340 


atttccgcta 


aatttagtga ctccaggaag 


gccagtgaag 


aaataacaca 


cgtagcaacc 


8400 


agagactgtg 


ttgtaatatg ttggctgaca 


gcagggtact 


ttctgtgatg 


ctgaaageca 


8460 


cattcatttt 


ctctcccctc atccccatct 


aagcaagect 


ggtagaatca 


taattacagt 


8520 


aataggtacc 


acttattgag tactctgtgc 


cagacaccct 


cctgagcata 


egacatgeat 


8580 


agcacattta 


atccttacaa tgacttaata 


aaatgtagta 


etagtcttac 


ctacttcgag 


8640 


aatagggaaa 


tggaggttac ttgtttaaag 


tcacagagct 


aataggtagc 


atagctgaga 


8700 


tttgaactca 


ggcattctta ctccttgcct 


gcaagagtct 


cttggcattc 


ttgaatgcaa 


8760 


gcatatttct taacctcact gaggctcagt 


ttcctcttat 


ataatatggg 


gtaaagagee 


8820 


ctcaccctgc 


ctgccacaca ctggtagtgt 


cagataacat 


tgaagggtgt 


tagtttaaag 


8880 


gcttcatgga ctctataatg tcaacaaaag 


tgctgttaac 


tttcttctgg 


gtctcaggct 


8940 


cctgatgtag 


agtcagtgga gcaaccctgc 


catctgetgt 


tatgctgttg 


atgttgctgc 


9000 


cacacttact 


aacctaaacc tttgattctg 


gctgtggcct 


tctccagaag 


gtgtttactc 


9060 


atttgtccag 


tttatctttt aggaaacagc 


cagcccgtag 


atcattaagg 


ctggctattg 


9120 


gacagggggc 


tggggcctgc ctgacagagg 


aaggaagggc 


agacatctgg 


ttcttcctct 


9180 


gcccctacaa 


gagactccag cctgaccaca 


gagtggtact 


cctaggatgt 


agcagcagca 


9240 


tatgagcttg 


aatgtgcctt aatcctgctc 


tttactttga 


gaagagagaa 


ctaaggaccc 


9300 


acagatgttt 


cacagcttct ataggaggca 


gaggtagaaa 


aatggagaga 


gatgaggeca 


9360 


gagatagata 


actgatatta attaaaegtt 


gtattaagaa 


cctcacttag 


attatctgat 


9420 


tcaatcttca 


taataaccct gcaaccccca 


cctttttttg 


agaacagggt 


ettgetctgt 


9480 


tgtccaggct 


acagtgeact ggtacaatca 


tagttcactg 


cagtgtcaac 


ctcctgagct 


9540 


caagcaatcc 


tcccacctca gecttgeaag 


cagcttggac 


tacaggegtg 


ccaecacacc 


9600 


ttgccatttt 


tttttatttt aagtagaaac 


aaggtcttat 


taatactatg 


ttgcccaggc 


9660 


tggtcttgaa 


ctccagcgat cctcctgccc 


cagcctccca 


aagtgcttgg 


gattaeggaa 


9720 


gtaagccact 


gtgcctggcc agtgcaaccc 


ccattttata 


ctaaaacagg 


aaggeccaga 


9780 


aaggtttgga gtaacttgtc cagggtcaca 


cagatgatat 


ttgaactcag 


gtctccctgg 


9840 


ctcccaagag 


agtctgcttt .ccactaggac 


tcccaggaga 


aaaaaaaaaa 


aaaaaacagt 


9900 


agacttggag 


acagaaaatc tgatttgagt 


cttagttgag 


ctaggctaac 


tgtgtaactg 


9960 


tgggcaagtt 


ccttagcccc tgtgagcctc 


agtttcttat 


ctgtaaaatg 


tcataaaaga 


10020 


aatccatctc 


atggagtegt tgt gat gate 


aaggactctg 


aaaacattag 


aatggtttaa 


10080 


tgtgaaggat tagcagcagc acatggcaac 


attgtgcatc 


ttatattaac 


tatccaaata 


10140 


tatcaagcgt 


catttgetat atataaaagt 


catcaaatta 


ggcactgtgg 


gggatacgga 


10200 
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gttggcatac tagcctggcc tcttaattaa ttcattaatt agcttattta tttttgagat 10260 

aggtcttgct ctattgccca ggctggagtg cagtggcatg atgatagctt actatagcct 10320 

caatctccca ggcttaaaca atcctcctga gtagctggga ctacaggcac acactaccat 10380 

gcccagctaa ttttttttta attttttgta gagacagggt cttgctctgt tgcccaggct 10440 

ggtctcaaac tcctgggctc gagatcctcc cacctgggcc tcacaaagtg ttgggattac 10500 

aggtatgagc cacggcacct ggcctggtct cttaactggt tccctaagac agctggaaat 10560 

agagaatgtc atggagcatt cctaaccatg ggctccagcc tggctttcat tctgtttctc 10620 

ccctgaaaca acattccttt agtaatattc cgaataacag cttcatcagt ctgtctaccg 10680 

accactcttc aggcttcatc ttatatgacc tcccaaactg cactaagggt tgtattagag 10740 

aaaagtggat aaagttcgga gtcaggctgc ttgagcttaa atgccagctt cacttaccag 10600 

ccacctgacc atgagtcagc tgcttaacca ttctttgcca cagtttcctt gtctatgaaa 10860 

agggaaatgg ctcccacctc aaaaagttgt taacattaaa ttcaatcatg tattcaaagt 10920 

cctgagcaga atgtctggcc atgactggga cttaacagat gttagcattt attattagta 10980 

tctgtcagtc ttgaaatgtt ctcttccctt ggctttcatg acattccaca ctctcctggt 11040 

tttctcttac ctctctggta atacctgttt gcttatcctt ctttgtccag ctctgggatg 11100 

ttaccattcc ttcaggcgtg ctgttttctc cttaggcagt cttacacaca ctcatgactt 11160 

ccttccattg tcctccacac actgatgacc ctaaaatcag tatctccagc ctaaaccttt 11220 

ccactgagtt ctagacccat atgttgtact atcaacctgg cttgtccatt tgaatgtctt 11280 

ccaggcactt cagactctct tctctagact ttgetggact ttcactcttc cccctaaaac 11340 

tggctcctct tccactgaaa catgtatgtc attgagaggc accaccatcc acccagtgcc 11400 

taagccagaa acctaggaat ccttgatacc tgttctctct catcctgcat atccaagcct 11460 

atcagtttta tctctaaatt atattttggt aggtttactt ctttcctttt ctcccaccac 11520 

caccctgctc caagctacca tcatctcacc tggatgtctg caatagcctc atctcccaca 11580 

gccactctgc accccctaat ctgttctcta tagagcagtt ggaaggagtg atttttgttg 11640 

tttgttttgt tttgttttag acagagtctc actctgttcc ccaaggctgg agtgcagtgg 11700 

cacaatttcg gctcactgca acttctgcct cccgggttta agcaattctc ctgcctcagc 11760 

ctcccaagta gctgggatta aggcaccggc ccccataccc agctaatttt tatattttta 11820 

gtagagatgg ggttttgcca tgttggccaa gctagtctcg aactcctgac ctcaagtgat 11880 

ccacetgcct cggcctccca aagtgctggg attacaggtg tgagccactg cacctggctg 11940 

gaaggagtga tcttaaaaaa aaaaaaaaca aaaaaaaact tgactgtgtc actctgtgtt 12000 

gtctctccta ccttgtatac ttccacaact tcccagtgtt cttggataaa gaccaaaatc. 12060 

cttaacttgg ccaggcgcgg tggctcacac ctatcatctc agcactttgg gaggccgagg 12120 

caggcagatc atgaagtcaa gagattgaga ccatcctggc caacatggtg aaaccccatc 12180 

tctactaaaa atacaaaaat tagctggtcg tggtggcgtg tgcctgtagt cccagctact 12240 

tgggaggctg aggcaggaga atcacttgaa cctgggaggc agaggttgca gtgagcccag 12300 

atcacgccac tgcactccag cctggtgaca gagtaagact ccatctcaaa aaaaaaaaaa 12360 

aaaaaaaaaa ttccttaatt tggcctacag tagagccctc cgtaatgtgg cctctctcca 12420 

catctccaca acctcctgct ccctgcactt cagcctcacc tctcttctgg acaggccctc 12480 

cttctgacaa gggctttgtt cattctgctc cctctgccta gaatgccccc ttactctgtt 12540 

cacttaactc ctgcttatcg tttagatctt tacctggatg gctcagagaa atatagaagt 12600 
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aattcctcac cctgaaaaat aggttaggtc cctgttttat gttttcatag acctttcctt 12660 

tgaggctttt tttaeaaaag tagttttaat ctcacattta ttcatgtgat catctcctta 12720 

atgatatctt aagacctcta atagaacaat ttggtcatgg actgtggggt ttttgcccct ' 12780 

cattgtgtca gcactgagca tattgttggc ataggaggga tatttgttga atgaattgct 12840 

agaggtggcc aagagatatg atgtaagtca ggcttttccc tgcccttccc cttccccttc 12900 

cccacatcct tcctatagca gccaccgtgg ctgcagttac tgtaaatggc aagacggaat 12960 

cagttccgga cattgggttg t-tttagaaaa ttgcctgcaa gtgtcagggt gataagttaa 13020 

agctttgtct tttgccctca gaggagctat cccatagtga gtagaagcca gagaagctga 13080 

ccccaggagt ccttctttcc agcagcaggt cttgagctgc acttctctgt agctacaatc 13140 

caggcaggaa caagccctag gtacctccgg agaggagggc aagagaggaa gaatgagttc 13200 

agctactcta gccaccaaac tgattatgaa ttgccctgaa atctgaaaaa tttcaattcc 13260 

aatcgtaagt ttgttttgtt tcattttgtt ttcttaaatt gtatatttga aagatggcat 13320 

taactaaaga tatatattca atatagagtg gaaaaaatgg aatacttgca tagtatcttt 13380 

tacttatagg tgatttatga tggggagtgg ggtggatagg ttggcagttc ccccaagaag 13440 

ttggaaatga agtttgtcct ctgtgagttg aactaattag atccacaagt aatgaaagca 13500 

gtattgtgtt gtagttaaga gcacactcta gaaccagatt gcttagtttc aaatcctggt 13560 

tctgcctttt attatctgtg tactttgggc aagttacttg ccctttgtgt gcttcatttt 13620 

tctcatctag aaaatggaga ggccaggcgt agtggctcat gcctataatc ccagcacttt 13680 

gggaggccga ggcgggcaga tcacctgagg tgagaagttc aagaccagcc tggccaacat 13740 

ggtgaaaccc tgtctctaca aaaatacaaa aattagccag gcatgatggc gggtgcctgt 13800 

aatcccagct acccaggagc ctgaggcggg agaaacactt gaacctggaa ggcagaggtt 13860 

gtagtgagcc aggattgcac cactgcactc cagcctgggt gacaagagct agactcagtc 13920 

taaaaaaaaa aaaaaaaaac aaactggaga tacaggctgg gtgcagggct tacacttata 13980 

atatcagcac tttgggaggc ctaggcggga ggattgcttg aactcaggag tttcaagatc 14 040 

agtctgggta acagagcaag acctcatccc cacaaaaaat caaaaattta gccaggcatg 14100 

gtggctcatg cctgtggtcc cagctactca ggaggctgag gcgagaggat tgcttgagcc 14160 

caggaggttg aggctgcagt gaaccatgac tgcaccacta catgccagcc tggatgacag 14220 

agcaagaccc tatctcaaaa aaaaaaaaaa aaagaaacga gccaggcgcg tttgctcacg 14280 

ccagtaatcc cagcactttg ggaggccaag gcaggtggat cacttgaggt caggagatcg 14340 

agactagcct ggccaacatg gtgaaacccc atctcaactg aaaatacaaa aattagccag 14400 

gcatggtggc atgctcctgt agtcccagct actcacttgg aggctgaggc acgagaatcg 14460 

cttgaaccca ggaggcggag gttgcagtgg gccaacatca tgtcactgca ctccagcctg 14520 

ggagacagag cgagactctg tctcaataaa taaataaaca taaaataaaa taaaataaaa 14580 

taaaataaaa taaaaaaata tggaggccag caggcacggt ggctcacgca tgtaatccca 14640 

gcactttggg aggccgaggg gggcggatca caaggtcagg agatcgagac catcctggct 14 700 

aacacagtga aaccgcgtct ctactaaaaa tacacaaaat tagccaggca tggtggcagg 14760 

cacctgtagt ccctgctact caggaggctg aggcaggaga atggcgtgaa cccgggaggc 14820 

ggagcttgca gtgagctgag atcgcgccac tgcagtccag cctgggcgac agagcaagac 14880 

tctgtctcaa aaaaaaaaaa aaaaatggag gttgggcgcg gtggctcgcg cctgtaatcc 14940 
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cagcactttg ggaggtcgag gcgggcggat cacctgaggt caggagttcc agaccagcct 15000 

ggccaacatg gtgaaacctt gtctctacta aaattacaaa aattagccag gcacgatggc 15060 

aggcacctgt aatcccagct acttaggaga ctaaggcagg agaatagctt gaacctggga 15120 

gatggaggtt gcagtgtgct gagatcgcgc cactgccctc cagtagagtg agattccgtc 15180 

tcaaaaaaaa aaaaaaagaa gaaatggaga tacaaactta ctacctacct ccttacaacc 15240 

taccctcaca gtattactgt gaataaaagt gtgtgtagca ctgggaacac tattcacaga 15300 

gcactcatga atgtttgttc tttgttatta gttactagag aggcaaatgt ctgccagggc 15360 

tgaataatat gtgtgaattg gtgattgtcg cacatatcta aagaagtagt tatttttttc 15420 

aattaaaact tagtttaaaa accaatataa ggccgagcgc agtggctcac acctgtaatc 15480 

ccagcacttt gggaggccga ggtgggcaga tcatttgagg tcaggagttc gagactagcc 15540 

tggccaacat ggtgaaaccc tgtctctgct aaaaaaaaaa aaaaagtaca aaaattagcc 15600 

aggcatgatg gcaggtccct gtaatcccag ctacttggga ggccgaggca ggagaattgc 15660 

ttgaacccag gaggtggagg ttgtagtgag ccgagtttgt gccactgcac ttcagcctgg 15720 

gtgacagagg gagacactgt ctcaaaaaaa aaaaaaaaaa accaaaacca atataataaa 15780 

taagtggcca gcaatgaaac agaaagtgaa aagttagtga agcaaaacta gtactgtatt 15840 

cagataaaga tgctgaatct agatttggtc accagaatag ggtcctttgt ggcaacctgg 15900 

gctagtttgg ctgactcacc actgccagga tgaaatttct ttcagtggct actcatttcc 15960 

ctttatttta agtccatgct cacagagcaa ccttctgatg cctaattcag cttcctggga 16020 

tacttaataa caggaagggt ctggaagtag tacctgtata ggggatatga gtgttctgat 16080 

tttaatagtc aattcataag tgtacagagg gtttgataaa tggttaggtc agaaccatca 16140 

cagaatgtct acacctcttt ggacattagg aaggtcaaaa acctgaaagg ccaaaagcta 16200 

ggcctagatt agggtcattc accaagaaaa catcagcctt gaagagttct ctgggtggtc 16260 

caccagtcaa ccttcctttg atcacacctc cttcctcgtt gcttctttaa gcattgacct 16 320 

gtaatgggta tggaattttt tgctcaccta actccttcct tttacagagg aagaagttga 16380 

agcccagaga gatttaatgg cttgcctaag atcacacgca gattttctgt taaccagggt 16440 

gatttttcag gtgttccctg ccagacgagg gcttttttcc ttgaattgcc tagagatttc 16500 

ttgagatatc cgaagcattt ttcccagtgc agcctggaga aggatgtccc tgtcaacaca 16560 

gcatttgtta ctcaatgtta gacattcaat tttctaatta gtatcatgga gcaacagtgg 16620 

atgattatct ataaggggtt gcaattccat gcttatgtgc ttacagccca tatagacaaa 16680 

tatcagctgt taaaatgaca aggcagtaga gatgtggccc caggacaaag gcatactctg 16740 

ctgttagtga acactagttg gccagcaaat ttcacatggg catatacacg gccaactgta 16800 

gactttaggc atttataccc attcagagag ccaaactggc aactaaagat cagcattctc 16860 

tttggcattt cagctttgcg ttctgttaaa aatcaetgct tgcttaaata cctctgatag 16920 

ctcttcactg cctgtaggca actctttagc ctagcagact tggtctttag tgctctgccc 16980 

ctactctctt ccaccattct ggcctcctgt ctaattgctg cccatatgtg ccatgcacta 17040 

gagcttacag acctgctcag cgttatatga gcataccata ctctttatgc ctcagtgcat 17100 

ttgcacatgt tgttccttca ggccagaatg cctgttactg cctggcaatc agcctattag 17160 

agtctgccaa taccatccca tcttctgtgg aggagccccc cgccaaatcc acccatacct 17220 

ctccccacca atcagagact tcttctctct ttgttattct cttcgttatt ctcttcatac 17280 

ctcagttata tccatttcag tatttgttta cacatctagc atcactctta gagtgtgaaa 17340 
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ttctccaagt gtggagccgt atctagtttg tctttgtatc ccagagctta gcaaagtgcc 17400 
tagaatgtag tgggtgctca gagtgtttgc tgggtgaatg atgtatttgt tgaacgactc 17460 



attcttttag gcaagagctt atcttttgag gtgataagat aagctcaaac ttatgtagac 17580 
taagacctca gtctgtaaat gtcatcccta agtcttaaac catcaaaacc agggcctcaa 17640 
ggaatggcat gccttctgca actgtagcaa cctgctgtgc ttattttgcc gtgtttttca 17700 
tttttccccc aaaagctaga gtcccttctc ccatgggcag tgctggaagt gtgctaacaa 17760 
attctttctc catactgctt acgattacaa aaaaaaccct cagcatctca tgccagactt 17820 
gagttaaggt tgttttcttt tgtgtgtcag ctgtattctg gtcatgactt cctgatgatg 17880 
ccctatagag attttgctga gatcagaggg tgctccactg ccatcagtag cactgactct 17940 
tgcagaagca ccgtttctga agttggctaa tgtcatccct cacgtttgtt tgtttgaaat 18000 
ttgttttagt tccagagata gcactttcat ggaatgacgc tatcttctag aatcactttt 18060 
tttttttttt tgagttggag tctcgctgtg tcgccaggct ggagtgcagt ggcacaatct 18120 
cagctcactg caatctccac cttccgggtt caagtgattc ccctgcctca gcctcccgag 18180 
gagctgttac tacaggcgca cacccccact cctggctaat tttatgtgtt ttagtagaga 18240 
cggggtttca ccgtgttggc caggatggtc tcgatctcct gactttgtga tctgcctgct 18300 
tcagcctccc aaagtgctgg gattacaggt gtgagtcacc gcgcctggcc tagaatcacc 18360 
tttttatacc ataacgtgag caccactgcc gcgtcaccaa ggaaagagag aggCugctac 18420 
tgtggggtta caaatgggta agagtggcac caggaaggtg aaagtctcta cttagccaag 18480 
gcttaacaaa atgtcaatca ccaaacattt atttattaag ctacgttcag gataagaaga 16540 
tgaacaagct atctgtacat tcattttctc gtttgtaaca aggtaatgat agtgatctat 18600 
cctgcctgcc tctgagggtt attgtgagaa taaaatgaaa tcaagtggaa aagcacttag_ 18660 
gaaaaagaaa agcattggtt ttcaattgtt agtgtggatc agaaacactg gggcttgttt 18720 
aaaatgcaga ttcttagccc cagtctcagc gattctgatt ctgtatatct gaagtgggac 18780 
tcaggaatct tgattttcaa caagctgacc agagggtcca atgctgctat tcctttagtt 18840 
acactttcag aaatattact gtaaatcaaa tggcaagaat aaaatagtta tttgaggcag 18900 
ttttagtatg ttggacctgg agtccaaaga cttgggtcaa actccagctt tgtcagttcc 18960 
tagacctgtg accttaaaca gcaaccttct ctgtgaacct tagttccctc aggaacggct 19020 
ctggtcacct cctgctgtac tccattgatg actcaccaca taaggctccc tgggagtccc 19080 
ccaaaccttt gctctcttaa ctccttttac agcicrtcctac atctcctgca ggtgctgtct 19140 
tctcctcctt tttccaggcc ctgctctgac acagcattca ttctcctctg ggaagggttc 19200 
cttcaatgtg tctccaagca catcacaccc aggaaggacc ctgtggccat atctgtctat 19260 
caccagatca aactacgtga aggcaggcac taggtactgt cagtgcccag cataggcctg 19320 
gcccatacca ggtgtccaca gatgcctagt aaagaaacct atgattcagg acccccatga 19380 
tgagcaacta tagcactaga acagtgataa taactaatgt ttataatgca tcttcagttt 19440 
acagagggct tttgtactca tcatctagtt tagttcctgc aacaacctct tgaggaatat 19500 
agcacaagca ggacaaggga agcccagaga tgttaaataa tttatccaag tttatgctgc 19560 
tgggaagggc agcactgaaa ttaaaagaaa agttttctga gctcaaatcc catgcccttt 19620 
cctcaatgtg agctctagca aggtattcag gaatcctgcc tctacagttc agagcctcaa 19680 



tttggacact tgaataaagt ccatccagta tgcaccatta ccatctcttc gctctacaat 



17520 
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attgctgggt 


atgttgagtt cttgtatctg 


atttttctag 


atttcctgcc 


cacattctta 


19740 


ctgtctggat 


atcaggaaag agtttatcaa 


atgcctgtgg 


aaatccaaga 


taaggtctca 


19800 


tgatgagtaa 


cccagtgaaa acatgaagtc 


aagtctaact 


agtcactact 


atttcactac 


19860 


tgctgactcc 


tgatgatcag ctccttttct 


aagtgcttac 


tgtccactta 


ttccatcatc 


19920 


-tgcctagaat 


ttatgtgaag gaatcaaagc 


aaaaggatca 


taaggcttcc 


tttttccagt 


19980 


atgtttttcc 


tcctttttga aaactgggcc 


agttagctat 


ctccattttt 


atttcatgaa 


20040 


tacatcccca 


gcgcctggta tatagtagat 


atggaacatt 


acactttgga 


gatattgcac 


20100 


ccattctcca 


gtttctccaa agttactaac 


aatggttcca 


tcactgtgcc 


aacatatttt 


20160 


cttttttcaa 


tatattggga aataattctc 


ccagtctgaa 


aatctgaaca 


catttcatgt 


20220 


gacttggtat 


cctcatatgt cttgggcttc 


caattctcca 


ttcctagttt 


caagttcatg 


20280 


aactgtaaaa 


caaaggatta gactaaatct 


ctaaagttct 


atccagatgc 


caaattcttt 


20340 


tctctttcca 


tgatacctaa gatagatgcc 


aaatattgtc 


ttttacctgg 


tgtttgtgaa 


20400 


catgacatca 


cattacagga gtagcagata 


ctaaactctc 


actctgtaaa 


acactgactg 


20460 


agttccatga 


gccagatact gaagtgagct 


tgttcacata 


tgttctcatt 


taatgctcat 


20520 


aaccctgtga 


agctgggaat tgctgggaca 


ttttatttat 


ttatttattg 


agacggagtc 


20580 


tggctctgtc 


acctaggctg gtgtgcaatg 


gcatgatctt 


ggctcaccgc 


aacctccgcc 


20640 


tcccgggttc 


aagcgattct cttgcctcag 


cctccgcagt 


agctgggatt 


acggggcaca 


20700 


caccaccaca 


tccagctaat tttgtatttt 


tagcagagat 


ggagtttctc 


catgttggcc 


20760 


aggttggtca 


cgaacacttg acctcaagtg 


atctgcctgc 


ctcagcctcc 


caaagtgctg 


20820 


ggattacagg 


catgagccac catgcctgcc 


cgggaccctt 


gttttagaag 


gatgactgct 


20880 


gctataatgt 


agaaagtgat ttggaagagg 


ggaggagtgg 


ggcacgaaag 


atggttagta 


20940 


gatgggggtg 


gtaatgctta cctttcagta 


tttggaggct 


tcggagtcct 


caaaaattct 


21000 


cttccttgat 


tggagtcctc ccagccaata 


gagggcttca 


cacaaacagt 


ttcttgggtt 


21060 


ttgaattgtt 


tgaccagagc tttcttccga 


caaaaggttg 


gggtgattca 


ttcacttacc 


21120 


acaccttgcc 


tgaacattca cttggggctg 


ccggttatga 


aggctattgt 


tctccagcct 


21180 


gtcacagacg 


ctttgaagac ctgtgcctca 


gctggttcta 


aggagtcagt 


ttgttcagct 


21240 


ccgtgccagg 


tttccaactt atgaaatgtg 


ctggagatta 


acacctctcc 


tgccatttta 


21300 


tccctactat 


aattgccagt caaaggattc 


ctgcagttgc 


ctctggcagc 


cataactgat 


21360 


gaatgttctg 


ccagctgctc tgaggaccta 


gaagagcagt 


tttctatcca 


ggaccagttt 


21420 


ccaagggtgg 


gagggtgaaa tatatcctcc 


agtgtgacat 


ttcatctccc 


agtgatgggt 


21480 


ggcttgggcc 


ctttgaagtt ggctctgagg 


aaccacacac 


ttgggtctga 


gcagccagca 


21540 


gcttatcaca 


tctggtgatc aatccttcaa 


aggttcctcc 


tgaagtctga 


atttttggag 


21600 


gtcaaatgga 


ttccacctgg gaggggcttc 


tgcttcaact 


caggacatgg 


ggagaaggct 


21660 


gttcctcttc 


cagggggagg cagttttcat 


ggcattgaga 


tgtcctctca 


cttattcccc 


21720 


acccacccac 


caagtccttt gtaagaggag 


tagggggaga 


ggagagcgcc 


tgcagcctcc 


21780 


tgctcacatt 


cctagacacc gactcactga 


gcccgtcgcc 


gctggaacag 


cagagctgtg 


21840 


tgaaatgtca 


agaggagtta tgctcatagg 


ctccctggcc 


tcagtctctt 


tgtggcttgc 


21900 


atattcttcc 


attagtactg tgttcatcac 


atggaaatca 


gagggtacaa 


ttaaaagata 


21960 


atttgctagt 


cccagactta atttggggcc 


cccttcttgc 


ctgattgaat 


tacaggggaa 


22020 


cataatagat 


ttttggtgag aaatagttgt 


ctgtgtggct 


gggagaaaga 


ttgctcccag 


22080 
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ctctccagct gggcagccct ttcagtatcc cgtatgttat ttccccactt ccagcccacc 22140 
tcacctcctc tgtggccctt gtgtgtcccc tcggctagga tcctgacctc ctgctcaaga 22200 
gtttaaactc aacttgagac ccaaggaaaa tagagagccc tctgcaacct cataggggtg 22260 
aaaaatgttg atgctgggag ctatttagag acctaaccaa ggcccagaca gagagagtga 22320 
cttgctaaag gccacatagc tagcccacag tagttgtaac aatagtctta atgatattaa 22380 
tggctaacat ttatcaacct ttaatgtgtc ccagactttg tgccaagggc ttacatgcag 22440 
tgcattgtcg cattcaaacc cagacagtct ggctctgggc ccaggctgag ctttggtata 22500 
gcatggtaga acgttgtcta taatgtctag tctgggttca aatcctggct tcacttctca 22560 
catttacagc tgagtgacct caggcaagtg atttaacctc cctgtacctc agttgcttta 22620 
tctgtaaaga gaaaaatcac agcactgtgg aatagtgggg gttaaaattc attcatacaa 22680 
gtagtgctgc aagcaatgtt taatacaggg tgagcacctg ttcagtgctt ccttcttctg 22740 
gctgcctctg gggctagagt gtggtgtctt cgtggtatag atagatagat atggctgagc 22800 
tctgcacaaa caccaagagc tefttcttcac tattagaggt agtaaacaga gtggttgagc 22B60 
tctgtggttc tagaacagag gccggcaagc tatggcccat tgcctatttt aatacggcct 22920 
gtgattgatt gatttttttt ttctttttga gacagagttt cactcttgtt gcccaggctg 22980 
gaatgcaatg gcacgaactc agctcaccgc aacctctgcc tcctgggttc aagcgattct 23040 
cctgtctcag cctctcgagt agctgggatt acaggcatgt gccaccacgc ctggctaatt 23100 
tttgtatttt tagtagagac agggtttctc catgttggtc aggctagtct cgaacttcca 23160 
acctcaggtg atctgcccgc ctcagccttc caaagtgctg ggattacagg cgtgagccac 23220 
catgactggc ctgattgact gattttttta gtagagatag ggtcttggtt tgttacccag 23280 
gctggtctca aacttctggc ttcaagcagt cctccctcct tggcctctcg aatgctggga 23340 
ttataggcat gagccactat gcctggccta tatgacctgt gatttttaat ggttagggga 23400 
aaaaaagcaa aagaatgctt tgtgacatgt ggaaattaca tgaaactcaa atatcagtgt 23460 
cccagcctgg gcaacaaagt gagaccctgt ctctacaaaa aataaaaaaa aataagccag 23520 
ggccgggcgc agtggctcac acctataatc tcagcacttt gggaggccga ggcaagtgga 23580 
tcacctgagg tcaggagttc aagaccagcc tgaccaatat ggtgaaaccc tgtctgtact 23640 
aaaaacacaa aaattagccg agcatggtgg catgcgcctg tagtcccagc tacttgggag 23700 
gctgagacaa gagaattgct tgaacctggg aggcggaggt tgcagtgagc caagatcgcg 23760 
acactacact gcagcctggg caacagagcg agactccgac acacgcacgc acgcacacac 23820 
acacacacac acacacacac acgctgggta tggtggccag cacgtgtggt cccaggatgc 23880 
actggaggct taggtaggag gatcacttga gcttaggtgg ttgagactac aatgaaccat 23940 
gtttatacca ctgcacttta gccagggcaa cagtgtgaga ctgaatctca aaagaaaaaa 24000 
aaaaaaaaga aaaaaatctt tccataagta aatatctgtt ggaacatagc catgtccctt 24060 
agtttatgtt ttatatatgg ctgcttttgc cctataatga cacaattgag tggccacgac 24120 
agtctgtatg gcctgcagag cctaagatat ttgctctctg gccctttaca gaaaaagtgc 24180 
cttgacctgt gctctagagc catatgtacc aggtttgaaa ctcagcctca cagctgggtg 24240 
tgatggcacg catctgtagt cccagctact ctggaggctg aggtgagagg atcacttgag 24300 
tccagaaggt cgaggtcaag attgtagtga gccatgatgg catcaccgca ctccagcctg 24360 
agtgacagag agagaccctg actcaaaaaa aaaaaaacaa aaaaaaaaaa caccctcacc 24420 
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acttatcagc tatttgtctt gagaatagtg acataacccc tcagaaccta tttcctaatc 24480 

tgttaaatga ggctgatgac gtttcctcct tttactggcva atttaaacat gatggataat 24540 

aaatgctaag cacttaacac agggcctaga agatattaac tgctcaataa atggtagctt 24600 

cttaacagta ttcaaaccca tgtgctctta tcacatgcat tgttgtccct gtgtccagtt 24 660 

ggtggaatgg gaaaaggctc ccttgtaacc ccatctacca tctttatcag actttcctgc 24720 

catggttcac agtaagagat agaagctgca cggtgacttc tggctcttta caatggtgag 24 780 

cggtgtgtgc ctggtaaggg agagctgatg tcactgcccc aaatccagta gtgagatctg 24840 

agtgttctgg tttcctccag cagccttgct ttttccttta caatcctgca ggcagggaga 24900 

caagggcttt ctacatggta ggctctggtt tggtcatcgt cacaactggg ggctgttcag 24960 

gtgggctccc attccagata cctaggctta tcaatccctt ttggcacccc aggccttttt 25020 

ctccctcatg ccccattttt cagtttgaaa agcatggtta tcacaggaca agtagaagaa 25080 

gctccactgt ccactgaggc caatggatgg tgttctgcat gtgaacactc agtgaatagt 25140 

gagtgaatga gagtaacctg ggctccatcc tatttgcaga gagctttgga aaagattttt 25200 

ctccttoaag agccagaatg aagcctggta gtgggagagc tccagctcta gagtcacatg 25260 

agcctacatt taaattccag ccctgccact gactcccttt ttgaccttga gtgagttacc 25320 

taatctctct gtacctcact tttcttgtct gtagagtggg aataattcct gtctcagaga 25380 

aataaaagag tgcatatagt gtttgccaca tggagacaca tcaggtgtag gttaatactc 25440 

tgggccttgt ttccttattt gcaacacagc cctgccctgg agtggaagtg gcacctccca 25500 

ttggtcagct cttgaggctg tccccaggac aggcagaggg agggaatgaa tgggagccct 25560 

agtgccagga cagaacagat ggcagctcag agctaggatg gctctctgga cctgtctctc 25620 

ctaccagagg tccccccgtc tggtgtggct cttcctggac ctggcatcct ctgctttttt 25680 

tttttttcca cctccaagca gaattactgt cctgtaggca gctcctctgc ttgaggacat 25740 

ctggggccag atatgttcac actctatcct gccttgccct tccctgagct caggatggac 25800 

gctcaattgg tcccagttat tgtctgcagc gcctgcctgc agcctcgatc cagcccagct 25860 

ccaccccttg cctgcaaggt ctgtttccta acagctgctc caaccacaca cctcggttct 25920 

gcgggagccc ctcctcttcc tccctccctc cctcattcag gggtgggact gaagaagaag 25980 

gctaacttga cagcagcgct tctttcttag ctagtcaccg gcccctgctc aagaatgcca 26040 

gtgtgtgtgt agcctccaca gagaggtcgt tttctcggag tccagagggg ccgcctgagc 26100 

ttctgagaac tagggaggag ccatcccagc catgagcccc tgtgggaatc tgctgggggc 26160 

caagtggcct ggagtcctca ggctcccgca gctgctccgg agggagaggt gagctcaggg 26220 

cagcctgcct gcagccagag gtgccgggag ccccgggcct gtcatggtgg ccatctacag 26280 

ccggcctgag gcagtcacag acggatttgc agctgagcct gtctatctgg tgtgggaaga 26340 

agatggggag ttacttgtca gtcccggctt acttcacctc cagagacctg tttcggtgag 26400 

ttggtctccg agttcccctc tccatctctc ctggcccctg gtcctgagag gagggtggtc 26460 

tccctaaatc tccttctcac ttagtccttt accatcggtt ctgccgggca gaagccagcg 26520 

gaggttatac ccaaggagaa tcggccttgt gaggtacccc cattatgtcc tggaagtggt 26580 

gaggggaggg atatacccag aaggaacttc ttagggagct ccagctcccc ttctatccca 26640 

gacaaacctg aaggagcctc caaaagatgc cactgacctg cccattgtag atgttactgc 26700 

ttccgggggg aatagcccaa atagagtgct gtttccagct ctcacatgtc ttacctgcgg 26760 

gccatgctgc ctgcccagga atttgtccca acaagcagga tgggcaggtt ttgccaaact 26820 



* 
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gtggaaactg gcaagtcctg gcftgtgggta gcctggtaca cagtaggcac cttataaacg 26880 



gtgaatgtag ttgctgggga aagacctggg cgagtgcttc taagactgga gcaatgggct 27000 
ttagagtgtt cctgagctgc tgggccagcc cccacacctc ctcagtccct aggcctaagt 27060 
acctccacga gcctctctct gtggggcttc tcagagggag atgtggaaac tctacctcta 27120 
acctggcttt ctttgctcat tgccccactc cacctcccat agaaactccc cagggggttt 27180 
ctggccctct gggtcccttc tgaatggagc cattccaggc tagggtgggg tttgttttca 27240 
ttctttggga gcagcctgtt gttccaaaaa ggctgcctcc ccctcaccag tggtcctggt 27300 
cgacttttcc cttctggctt ctctaagcta ggtccagtgc ccagatcttg ctgccgggat 27 360 
actagtcagg tggccaggcc ctgggcagaa aagcagtgta ccatgtggtt ttgtggaatg 27420 
accggaccct ggtagattgc tgggaagtgt ctggacaggg ggaaggggga agggaactgg 27 480 
tcctcaatgc tgactctacc aagcgccctg ctagacactt tatcctttaa tctctcaaca 27540 
gcctaaagag attatatatc cccattttac agatgaggca accagtttca acagagttaa 27600 
catatggagc ctcactgggc agctttttct gtcttcctga ctttctctca tccttcaggg 27660 
ggctgcaggt ttgttttctt ctcctagtgg agaggaaatt ctcaggtttg ttttcctctc 27720 
ctagcagaga gtaaaaaaag ggatagtttg cctgacttgt tgaaggtgtg gctgagattg 27780 
ttttctaaag agccaatgga aattgatctt gagtttagga gaaagctttt acatgtggaa 27840 
ttaagatgcc aagtgttgaa gtagccacat ttcaggtcct cattaatttc tcttaatcct 27900 
gggaaggcag cttaggagaa gggttgttcc tttaggagcc aggaactata ccccttttac 27960 
ccttggagag gcagggaagc cagggaggac acaacttctc aggaagagga gaagctagag 28020 
cagatagtga actctcaacc tgaaccttta agggccagac cactaatgcc acccaagtcc 28080 
acctgccgtt tgtcttgttc tgtcccaggc tttctggaga acctgatctt cttgccccta,. 28140 
cccccaagct ccgtttgccc agctagagtc tggggggtac tgactgactt tcgtagacat 28200 
tcttcccttc cccaaataag aggccacatt cctgaagtca cttctgaaga gatagctgcc 28260 
acacagggct ctttcccccc agggagggac cacccagacc ctctgctctc ccaggtatcc 28320 
gttaccacat cactacctgg tcagaaagct gtttctgcca ttagcccctc cctcttttat 28380 
tataggatat cctcaagggc tcctctttgg gcctcagttt catccttggc agaaagtaga 28440 
agctagactt cttgggctcc tgaacagggt ccttgctgga ttctgtgaaa caaattaagt 28500 
tcttgaccct aggcctctgg gggagtacaa agtctatggg agttctgggg ctgtggttgc 28560 
aaggaaagtg acgcaaccag attccatggg gacatgatca ggcgtgacat gtgagggagg . 28620 
aagagggagc aagggaatga agaatacaac ttctgtgtcc catacacccc tgcctgacag 28680 
gccatacata ctcagcagag aatgcactgt ctttcctacc acactagcgt gaggagtgag 28740 
ctgcaattac cactgtgctt ccaagtaaga aaatacctca aattggaatt tacaaaagag 28800 
gtaaattagg gagtggcttt tgtcggacat ctttaaagca tttttctttt tatagaattt 28860 
cacttaatgt ccaatactga tttaatgagc ttgggtttac acottatctc ttgaagaaaa 28920 
caaatgaacc tttgtgttcc aaagcaatcc atgtttaaag ggaaaaaatt atgcataact 28980 
ctgcccagct tcacagtaac ctttggcagg tgccttaggt cctctgggac tcttttcctt 29040 
atctgaaaaa tgaaggactt ggatcaggtg aatggttccc agctctgcaa cttatgtggc 29100 
tcctcagagg cacacaagct cttttccatt atttgccaaa taatggaggc cctgtcttta 29160 



tttgttctct taatggcagg cacatttgcc tctggccttg aagggcttct gagctcccag 26940 



I 
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actgcagtac aactacacaa aatacttgaa actacagtct tcctggtttt tggttggaac 29220 
tgaatcagtg cactctagca acacttattt cttgctgttc gtaggcttca ttatgtgttt 29280 



tgtttcagtc tataggatct gcaggaagga ggagtaataa agggattttt gactgagctc 29400 
ttatggaaca gagtctctct aggcccctgt catatctgcc cttctgggcc ctggggaaaa 29460 
gttggcatcc ccagttgtgg tgctctccag gtgccctcag gctgtggtgg agggagcttc 29520 



actggctgtt cactgggagg ttaagggatg accatccagc caggccttcc tcaggacatg 29640 
ggagggctta tgctttaaca tgtgtaaatc cactgcaata atgactggtt cttttacccc 29700 
ataaggttga gaatttacct gtaaacattt ttgtctgaag aatttggatg taagtgaggg 29760 
ctgggcctct atcttatctc acttggcttc tctcagcaca gcaccttgcc tgcttgttct 29820 
tacacatcct agatgcacag taactatttc ctaattatta gaaatctatt agaatcaatt 29880 
gatttcagct gggcttggtg gctccttcct gtaatcccag cactttggga ggctaaggct 29940 
ggaggatcac ctgagtccag gagtttaaga ccagcctggg caacataggg agaccctgtc 30000 
tctacaaaaa ataaaaaatt agccaggcat ggtggtgtgc acctgtagtc ccagctactc 30060 
aggaggctga ggcaggagga tctcttgagc ctgggaggtc agactacagt gagcaatgat 30120 
tgtgccactg cactccagcc tgggtgacag agtaagactc tgtctcttaa aaaaaaaaaa 30180 
aaaaaagttg atttctattt ggatagataa ataattcatt ttaggacctt tctttttcac 30240 
ttacagaaat ctgtttcatt ctgggctgag aagcaggtcc atattgctag gcataggaga 30300 
aaaaggggtc tgtctgcatt tgcccttggt ggtctcaaat tggggaggga aagaaatgaa 30360 
cacttactgg ctaccttctg tgagccaggc atcatgcaag acatctgtac ataatttaat 30420 
tctcataacc ccataagata ttattagcaa tgtacaagtg aggaaactga ggctcagagt 30480 
catgaagtaa ctggccttgg gtgacacaga tggtaaatgg cagagaagga atatggatcc 3054 0 
aggtcttgaa agagaaaatc tcaactgatt atctttttta aaaaactcat atgttctctg 30600 
ctgactcaaa aggtctctgt gtggatctgg gttgacccac tgaactgacc atcagggttc 30660 
catgcacttt gtatctgccc aagccctcag aacccctcag taatgttttg gaagatgagt 30720 
tttggaggtt gtccttaggc atagcctcag cgtatgtagg cctctaggtg atctccccta 30780 
acctgaggat ttcagctcaa ttcactctgg ctcctcagga cagtgggatg actggttcag 30840 
acctcagctt taccacctcc cagctgggta ctcttctacc tacagccagg gcagattttg 30900 
actttcactt gaaacttcca aaaattgaaa ggtagaaaaa cagccttggc tttgggaaga 30960 
acgtatgatg tccatggcct ctaagcatct gaggtgggac atgttcgagt agcaccttac 31020 
agttccaaag tgtgttctgg gttctttgtt taaaagaaca gagactgctg gggaattgaa 31080 
cactgtgaag tatatgaagg aggagaattg tgctatttaa cattcagtac ttgggctaaa 31140 
ggagaagcat cacgaagtgt taacactcaa agggtcttga gctgtcaggg ctccagcttc 31200 
cttattttca caggtgagaa tcctgaggct cagctgttga gatgtgctgt ctcactccgg 31260 
tgacatagta cagtggatgt ggctttgcag ccaagcacac atagcttcac attccagctc 31320 
catcaattat gtattgggca gctttgcaga atgatttgac tttaactctg cttttcagtc 31380 
ttctgtaaaa cagggataat cctgctaccg tagggttgtc aggattagag ataatataaa 31440 
taaggtacct catataggac ctggattatg gctggcattc aataaatagt agctgttaat 31500 
tgatagctaa gctagaactc tgaagtctac catggcaact tcttaagtgg tctgagaacc 31560 



ggttaatttt ttaaaacaac aataacatat tccataataa ttacagctta attggcagac 29340 



ccattctctc cttcagccca ctcaattcag aggctagggg ctgaaagaag cttctctaca 29580 
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cagttgtgtt ctgtggcaaa acacagctta gggatccata cccagccctc ctgtcagctg 31620 
ttcaccttcc agttcttcag agacatgtgt ggcagtgact ttggccacat agctggctgt 31680 
gccctttaaa ggcattcctt gacacagata tgtggactgg tgacgttgct ctccagccag 31740 
gtgttcttcc cagcaggctg gcctggctgt ctcctgcatg cctgtacttg tttgtctccc 31800 
tgctccctct cctgggcctg gccagagcta cttgcagcaa acaaaagcag gatattggca 31860 
atggaaagga gggtgtgttc tggtgctccc atgccctgcg gcgcacatac cattgcaagg 31920 
gcgtaacaga gcccaggcct gcatttgggt gcaaataagt ctgcacacag aagaaaagaa 31980 
ggacctggtg accaggagcc atggaaccct tgtgctcccc tacctgggct actggttctt 32040 
gccactccta ccattttcag tttggaaata tttgttaagg ctttgctctt ccaggtcctt 3210O 
tgcttggtgc tgagtctacc aagagtaagt gggatgctgt ttttgtcctc agggagctaa 32160 
cagtctagtg aagaagaaag atggttgccc aggaacttct aagtcagaag gcaggaggca 32220 
agaaggaagc ccctgctcct actgccagcc ctctgttggg caccccatag ttcttcagaa 32280 
ccacatttaa tcctcactgc aggccaggca tagtggctca cacctgtaat cgcagcactt 32340 
cgggaggcca aggcgggcag atcacttgag gtcgggagtt cgagaccagc ctcaccaaca 32400 
tggggaaacc ccgtctctac taaaaataga aaaattagcc gggtgtggtg gcatgcgcca 32460 
gtaatcccag ctactcagga ggctgaggtg ggaaaatcac ttgaactcgg gaagcagagg 32520 
ttgcagtgag ccgagattgt gccactgcac tccagcctgg gcgataagag caaaattcca 32580 
tctcaaaaaa aaaaagaaaa aagaaaaaat cctcactgct accttgaaag taggtgatga 32640 
cattgccatt tcacaaatga gaagtgaagg ggctagccca agatcactta .ggtggtaaat 32700 
ggtggtgcta agattagaac ctcagatcat ctagggaaaa acacagatat gcacagagtt 32760 
aaggggaccc agggtattgt ttgtcctctt gtttcacagg tggggaaaca acccagagag 32820 
ggaaaggggc ttgtccaagg caatttagca cccaagaact tgaacccata tctctctcct 32880 
cctcatttag agctcatccc acatgtatct tatattgaga ggagtgtgag ccacatacca 32940 
agaacagtct tcccctctgc ctccaacctc actgtgcagt tttgagacac ttcacagcca 33000 
tactcttcat gccataccca gcccttaaga ccctgaagtt ccccttccat aagacaagta 33060 
ggaaaagcta tagggtaaaa atagccatca gtgtttgttg agcacccagg aggaattggg 33120 
cactccagaa agataaaggg attctcaggg acttgcttct ctagacttcc ctagctcagc 33180 
tgcttcaact cattcctgcc cctcttctct acctcccgca gtgctcagaa gtagtagaac 33240 
tcactgtggc ctctcacctt gcattgttga gttttattta gactttctct tcctcaactc 33300 
ttcataagct catgaaaggt gaagtagggt gccctgtgta tttatctttt atatctgcag 33360 
tgcttagcaa gttataataa tgcacttgcc tggcaaaagg ctttctctca tacattagct 33420 
tatttcctct tcacattggc tctttgtagt aataggatgc tattagttat tttcaatgag 33480 
agaaagctac taagagaagt tgtccagcta gtgacagtaa gtggctgata aagtgagctg 33540 
ccattacatt gtcatcatct ttaatagaag ttaacacata ctgagtttct actatattgg 33600 
gtcttttttt tttttttttt ttttttttta gagacggaat cttgctctgt tgtccaggct 33660 
ggaacgcagt ggtgcaattt tgggtcacca caacctccgc ttcccaggtt caagcgattc 33720 
tcctgcctca gcctcctgag tagctgggac taccagtgca cgccaccacg cccggctaat 33780 
ttttgtattt ttagtagaga cagggtttca ccatgttggc caggctggtc ttgaactcct 33840 
gaccttgtga tctgcccgcc tcagcctccc aaagtgctgg gattacaggt gtgagccacc 33900 
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aggacctcca aggagagtgt tacaggctga gaggaggata tacttaggtt gtctttaggg 36360 
aatcagaaaa ggagactctg gaataggctg gcagagagag gggctacctc ctatacctgc 36420 
tctggacaaa cgactttaag catagtgaca gatttgccaa ccctgtattg gaagaactga 36480 
tcttttttag tggggatgat tacttctggg gatttcttct cataactgag accaaaacag 36540 
ttttgtgcag tctcagaaat gacaggaggt accaatctga cacttccttt ggaagctcta 36600 
gggcagagag tgaaagagtg gattttgacg ggggccttgc ttggaggtca ttcacccacc 36660 
cctgtcctca ctccagcaac agtgataact cacttccttc ctccctttgt acacccttct 36720 
ccccacctgc tcacaggtgg ctggggagtt caagtaccac ccagagtgct ttgcctgtat 36780 
gagctgcaag gtgatcattg aggatgggga tgcatatgca ctggtgcagc atgccaccct 36840 
ctactggtaa gatagtggtc ctttgtctat cctctcccat ataagagtgg ctggcgggga 36900 
gggacagtgg cagggtgagt tgggcagaag gagtgttagg gtagtcagag cattggattc 36960 
ttaccacagc agtgctctta accagctctt taacttgtaa gcagaatgat ttacacatgt 37020 
ctctaccctt tttccttacc aaccttgaaa atgtcttcac tctgccctgc aatcctccca 37080 
gtgggaggca ctcttcaagg acgatcccag aacattaaag tcaaagaccc cttagagctc 37140 
accctgtcca accaccttgg ttgataaaag aagtcagcct ggggcccatg gaatagaata 37200 
gtacaagggc aaggttctca ttgtgagtca aaggtagagt gaagagaacc cagaccatct 37260 
caccccaacc caggccagtg tttttccaaa tataccactt gctgcagatc tagctcagca 37320 
cccccagtcc cagcccaccc tgagaaccca ggctcctcat tctgagcagc cagctagaat 37380 
catgacaaag agggtggtag tgagactatg ggtactgttg cttaaagcca catggtgcag 37440 
tggttgctgg ggggcttctg tgtgggactc tagcatctta ttcccccctg tgccctctcc 37500 
ccagtgggaa gtgccacaat gaggtggtgc tggcacccat gtttgagaga ctctccacag 37560 
agtctgttca ggagcagctg ccctactctg tcacgctcat ctccatgccg gccaccactg 37620 
aaggcaggcg gggcttctcc gtgtccgtgg agagtgcctg ctccaactac gccaccactg 37680 
tgcaagtgaa agagtaagta ttttgagaac ccttcagcag gggttcttga gcagagtctg 37740 
taaatgggcc tcagagggct tagacctcca aagtctcatg cagaactccc tttattctca 37800 
tctcatatct ttctcctgga ccccactatg ctgtaaccgt acctgggcct tggcacttac 37860 
tgttctctct gcccaggcta cttcctaccc gatacttaag gcaagaatca ctcacctttc 37920 
aggtgtcagg tttcaggtca tgtttgctct ttgaaatcat ctggcttgat tatgtgtatt 37980' 
agttgtttat cttctatccc ctccactaga atgtaaattc cagaagaaac ttgctgtctt 38040 
attcagtgct gcatgcccag ggcttggaag agtacctggc atatagtagg agttgattga 38100 
ttattatttt gtcagtcgag agaatgaatg gagaaaatgt ggtccatggc ccaaaagaag 38160 
ttaagaccct atcctagatt caggccagag accagatgga gaaagagtct gtgtctatct 38220 
aataccagta atgtcgtacc tctggccgct taccatgtaa atattgattg tgtatctacc 38280 
atgtgttgga cactaggcta gtgcttgcac agcaggtgaa agatactaga gtttgggaag 38340 
tcaggaggag ctaaggtctg ttctacaacc ttattagatg aagaggagag ggaattgtgt 38400 
tcagggcaga gggagaagca tttctccaaa agtaggagtc ttaatcatgt ctgatgtagg 38460 
ttgagtgtgg ccagaaaagg ggctgttaag tatagagggc ctggattatg aaaatccagc 36520 
agatccattg agagtttaag cagcaaggtg ttgtgaccaa gttaacattt tagaaggatc 38580 
actggtatgg aggttggatt ggagagggga aagcctaaag gtatagagac tagttaggaa 36640 
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gctattgtag gctgggcatg gtggttcatg cctgtaatct cagcactttg ggaggctgag 38700 
gtgggaggat tgcttgaggc caggagttga agaccaacct ggccaacata gcaagacccc 38760 
gtctctgttt ttcttaatta aaagaaaagt ccagacgtag acatagtggc tcacgcctgt 38820 
aatgccagca ctttgggagg ccaaggtggg cagattgctt gaggtcaaga gtttgggatt 38880 
aggccaggcg cagtggctca cgcctgtaat cccagcactt tgggaggccg aggtgggcgg 38940 
atcacaaggt caggagatca agaccatcct ggctaacaca atgaaacccc gtctctacta 39000 
aaagtacaaa aattagccgg gcatggtggc ggacgcctgt agtcccagct actcgggagg 39060 
ctgaggcagg agaatggcgt gaacctagga ggcggagctt gctgtgagca gagatcacgc 39120 
cactgcactc cagcctgagc gacagagcga gactccatct caaaaaaaaa aaagagtttg 39180 
ggattagcct ggccaacatg gcaaaacccc atctctacaa aaagtacaaa aaaattagct 39240 
gggtatggtg gtgcgcgcct gtaatcccag ttactcagga ggctgaggca tgagaattgc 39300 
ttgagcctgg gaggtggagg ttgcagtgag cccagatcat gccactgcac tccagcctgg 39360 
atgacagagt aagatgccat ctcaaataaa aattaaaaac aaagtttaaa aaaaaaatag 39420 
aagctattac cgtgatccag gtaagagatg tgaataacta caatgatgga aagaaggcag 39 48 0 
agttcttaga gatgggagta ggagagatga gggaactcca gattgggaag atgatgttca 39540 
agtttctggc ttaggccaca gggtgagtgg caattccctt cactgagatg gggcatcctg 39600 
gaaaaggtgt tgcctttctg tgtgggtatc ctgggcccct taggggccac tggtggcctg 39660 
ggacctggta aaccttccct gcacaagcag aattggtcaa gcaggttttt aggacatctt 39720 
taccctgcct caactcttgt ctggcccagg gtcaaccgga tgcacatcag tcccaacaat . 39780 
cgaaacgcca tccaccctgg ggaccgcatc ctggagatca atgggacccc cgtccgcaca 3984 0 
cttcgagtgg aggaggtaga gtgtgtgtct aatctgtctt gtgagggtgg gacatggaac 39900 
agatcctctg ggaaatcagg ctgtagcctt taccttttcc tacccccagc ccatctcttt 39960 
gtcttagcat tgagcctgtg accactggtg acctatttca gcgtaacagg ttcccagggt 40020 
agcagggatg gttgatggac gggagagctg acaggatgcc aggcagaggg cactgtgagg 40080 
ccactggcag ctaaaggcca ccattagaca agttgagcac tggccacact gtgcctgagt 40140 
catctgggtt ggccatgggt ggcctgggat ggggcagcct gtgggagctt tatactgctc 40200 
ttggccacag gtggaggatg caattagcca gacgagccag acacttcagc tgttgattga 40260 
acatgacccc gtctcccaac gcctggacca gctgcggctg gaggcccggc tcgctcctca 40320 
catgcagaat gccggacacc cccacgccct cagcaccctg gacaccaagg agaatctgga 40380 
ggggacactg aggagacgtt ccctaaggtg ccacctccca ccctggctct gttctgtcct 40440 
atgtctgtct ctcggatgaa gctgagctgg ctttcagaag cctgcagagt taggaaagga 40500 
accagctggc cagggacaga ctatgaggat tgtgctgacc cagctgcccc tgtggggatc 40560 
acagtttaca gccagagcct gtgcggaccc agctgtctgc caggtttcct tagaaacctg 40620 
agagtcagtc tctgtccact gaactcctaa gctggacagg aggcagtgat gctaaaccct 40680 
gaagggcaac atggcctatg gagaaagcat ggagctcaga gcctggagta cgggcacaga 40 74 0 
taggattgaa taaattgtgt agaaagactt tgaaaacaat aaagcaaaag atgaatgaac 40800 
gtttttttta gacttgaggg accaacaacc cccaaacccc agattctgcc aggtccatgg 40860 
ggaaggagaa gttgccttga gtggaagccc caagtaggga gacttacaga aaagaagtca 40920 
agagcactgg ctcccaggca gaaatactga taccctactg gggcttcagg ctgagctcct 40980 
cccttcacaa atcacttcat ctctctgagc ctgtttctgc atctgtgaca taagatggta 41040 
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agataaaggt ggctgtctca ccaattatgt aaggattaaa tgtggaaaag gacataaagt 41100 
tgtatagtgc tgccataggg acagtgttca gtaaacgtga cacattctta gtatcactaa 41160 
gaatcaggtt cttggccagg caccgtggct catgcctgta atcccaacac tctgggaggc 41220 
ctaggtcgga ggatggcttg aacacaggag tttgagacca gcctgagcaa catagtgaga 41280 
cactgtctct acaaaaaaaa aataataata ataattgttt ttaattagat gggcagggca 41340 
ctgtggctca cacctgtaat cccagcactt tgggaggcca aggccggagg attgcttgag 41400 
gccaggagtt caggagcagc ctgggccaca ttcctgtctc tacaaagaat aaaaaagtta 41460 
actgggcatg gtggcacatg cctgtaatcc cagctactca agaggctgag gaggaggatt 41520 
gcctgagccc aggagttcaa gactgcagtg agccttgatc acaccactgt actacagctt 41580 
gggcaacaga gtgagacctt gtctccaaaa aaaaaagttt gttttttttt atccactctc 41640 
ctcaccaaac aaactgagta agttagagcc ctctcagctg gcatgtgttg gaaacagtgc 41700 
cctctcatta aagtgctgcc ctcactccca ttgcctcttg gccttggtca gtatgatgaa 41760 
attagtggga ggcagggcaa cagagggcag ggaagagcta gaaatccatg gcctggaaaa 41820 
gggaagattt gggagtggcc aggtatctgt agagccacca tgcagaggag gggggcagct 41880 
agccttgtgt gctctggtgg gcatggtcag caggaggcag agcaaaagga caagggtaag 41940 
taaacctgta ggtcgggaca agccaagagc catccagcgt cagtcctctc tggg-tagccc 42000 
aagtaaagca ggagcatacc ccagagagaa agttcgcagg gc-tgttcacc tgcagtgctg 42060 
tggacttcaa ccttcttgtt ccttcttcag taagtgaaaa taacagtcat tgaccatgac 42120 
tattatcgac cgcttttgaa aatgtaaaca tagtgacttt attgctgtaa aaatcatacg 42180 
tgtttatcat cttaaaattc aggaaacatg gacaggtaca aagatgtgca aaatatcatc 42240 
caaaatccca tttgctggcc aggcacggtg gctcacgcct gtaatcccag cacattggga 42300 
ggccgaggcg ggcaaatcac ttgaggtcag gagtttgaga ccagcctggc caacatggtg 42360 
aaaccctatc tctactaaaa atacaataat taggctgggc gcagtggctc acgcctataa 42420 
tcccagcact ttgggaggcc gaggtgggcg aatcacaagg tcaggagttt gagactagcc 42480 
tggccaatat ggtgaaaccc catctctact aaaaatacaa aaattagggc cgggtgtggt 42540 
ggctcacgcc tgtaatccca gcacttaggg aggccgagac agatggatcg cgagatcagg 42600 
agttcgagac caacctagcc aacatggtga aaccccatct ctactaaaaa aatacaaaaa 42660 
ttattcggtt gtggtggcac acgcctgtaa tcccagctac ttgggaggct gaggcaggag 42720 
aatctcttga acctgggagg cagaggttgc agtgagtgga gatcccgccg ttgcactcca 42780 
gcctgggcga cagagtgaga ctccatcaaa aaaaaaaaaa aaaaaaaaaa aaattagccg 42840 
ggcg^ggtgg cgtgcaccta tactcccagc tacttgggag gctgaggcag gagaatcgct 42900 
tgaacctgga aggcggaggt cgcagtgagc cgagatcgtg ccattgcact tcagcctggg 42960 
cgacagagcg agactctgtc tcaaaaataa taataataac aataactagc cgggcctggt 43020 
ggcacatgcc tgtagtccca gttactcagg aggcggaggc atgagactca ggtgaactag 43080 
ggagacagag gttgcagtga gccaagatca caccactgca ctccagcctg gttgacagag 43140 
cgagactctg tctcaaaaaa aaaaaaatcc catttgctca ttttttggat actagtataa 43200 
ctatcactct aaaccagtta gtacttaaat caagcagata tgggagatgg tgaattacca 43260 
tctacagtgt tgtcatatat gtcacatact gagcattatc agctagtaga atctagttaa 43320 
ttgttctatg tgtgatgtat gcagagttcc cattttgaat gtgtttttac tatgcttaaa 43380 
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taaatgactg atgtcagcaa ccccaaaatg atacatctga tgtaagagcc cctgttcccc 43440 
aataataaca tctaaactat agacattgga atgaacaggt gcccctaagt ttcctccctc 43500 
cagggtttct tggccggtct ctgaggacta cacatcccta ctcccgtctt tcctcatctt 43560 
caggcgcagt aacagtatct ccaagtcccc tggccccagc tccccaaagg agcccctgct 43620 
gttcagccgt gacatcagcc gctcagaatc ccttcgttgt tccagcagct attcacogca 43680 
gatcttccgg ccctgtgacc taatccatgg ggaggtcctg gggaagggct tctttgggca 43740 
ggctatcaag gtgagcgcag gcaacaattg ctttgctctt ctgcccccag tccctctgtc 43800 
actgtctttc ggggatttct catcacttgg ccccacccca caccatgcag gatgccaggc 43860 
ctccttcctg gctttgggtg ttggtgtgag aggtatcctt cacccccacc caggccacct 43920 
aaggtcaatg ttgctgttac agtgagcttg tggacctgga gatccaggtt gggttgagct 43980 
gtgcctgtgg ccctcctgcc tccagtcagt gggtgtttgt taggtgcctg cagacctcag 44040 
taccgggcat gctacaagga gcacacaggg gaatggctcc tgcctccctg gtgaacagtc 44100 
tcagggacta acctctctct ttctctcctc ctcctcctct tctgctgaga actgggaggg 44160 
ggggtcaggt aagacgtgtg tctcagcttg ggggcagcag ggctggagag ctcacccccg 44220 
atccacccag ctccctggtg catgtctttg gcactgacct tcctgccccc agacttctgt 44280 
tcactcagga gactcacttc tatgccaaat gaccagagcc cctgcttggc ttggcagcat 44340 
cccctcctgc cttcttcccc acttcccttt tctgggttct tgcctgtcct ctgtgcatgc 44 400 
ccagctctcc aggaaagagg gtttgcttcc gtgtgagtcc catgttgctc cacgctgcat 44 460 
cttccacaca tgaactctgt cattctgacc cggctcagtg tgccctccaa gggatgggat 44520 
ggccagctgc atagattttc tcaaacagtt ctccagaact tcctctggtc tcagcaccat 44580 
taacagtcac cctccctgta ggtgacacac aaagccacgg gcaaagtgat ggtcatgaaa 44 640 
gagttaattc gatgtgatga ggagacccag aaaacttttc tgactgaggt aagaagatgg 44700 
agggggcccg ggaggttggt gtcaccattg gaagagagaa gaccttacaa ataatggcttf 44760 
caagagaaaa tacagtttgg aattactgtc ttaaagacta agcagaaaag agccctagag 44820 
gaatatccca ctccctctaa attacagcgt aattatttgt tcaatgaaca cttactaaaa 44 880 
gcaacacaaa cagggtacaa gggatgcagt aacaaaagat acagggttca gaagagctct 44940 
caggttatga ggatgatgga catgaaaaca ctccaattta gtacaactca atgttataat 45000 
cctcacctga acgccctgct aagggagcct ggaggggagc tccctgagca ctcacactcc 45060 
ttgggcattt acagttttca ctacccctcc caagttactt catggagtaa cttaagttgg 45120 
ggacacctgt ggtctgggta ttgccctcca agccacttgg ccactcccac cccagttctc 45180 
ccaatgcagt tccaagggta aggcctatga agccatctcc atctatatgg tggtggtctt 45240 
ccctcatcct gatcttagtg ccctgtcata tcacaagata ggaggtagga gatacaggtg 45300 
gtaacacttg tcaagctgat tccttggagg gaagaggtaa ggaagacagt gagaag-ttaa 45360 
ccaccagctt tccttggctt cccccacccc caggtgaaag tgatgcgcag cctggaccac 45420 
cccaatgtgc tcaagttcat tggtgtgctg tacaaggata agaagctgaa cctgctgaca 45480 
gagtacattg aggggggcac actgaaggac tttetgcgca gtatggtgag cacaccaccc 4554 0 
catagtctcc aggagccttg gtgggttgtc agacacctat gctatcacta ccctaggagc 45600 
ttaaagggca gaggggccct gctttgcctc caaaggacca tgctgggtgg gactgagcat 45660 
acatagggag gcttcactgg gagaccacat tgacccatgg ggcctggacc acgagtggga 45720 
cagggctcaa cagcctctga aaatcattcc ccattctgca ggatccgttc ccctggcagc 45780 
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agaaggtcag gtttgccaaa 


ggaatcgcct 


ccggaatggt 


gagtcccacc 


aacaaacctg 


45840 


ccagcagggc gagagtaggg 


agaggtgtga 


gaattgtggg 


cttcactgga 


aggtagagac 


45900 


cccttcctat gcaacttgtg 


tgggctgggt 


cagcagctat 


tcattgagtt 


tgtctgtgtc 


45960 


actgaaactg accccagcca 


actgttctca 


gttcacagcc 


ctgttttcaa 


agaattacac 


46020 


atc-tctaaag gcaaacaggg 


cacggacaag 


gcaaactgga 


gaggcaaact 


gtagcctgag 


46080 


atggcctggg cttgccatca 


caggtattca 


ggtgctgagg 


gcccttagac 


caactagagc 


46140 


acctcactgc ctaggaaatc 


aatgaagggg 


aaatgagttc 


tagcggagcc 


ctgaaggatc 


46200 


agaattggat aaagttctta 


ttggcagaga 


ggcaccagga 


ttgaagtgac 


aggagcaaag 


46260 


acctgggagg aaagaggaga 


aaatcatcta 


tttcacctgg 


aaacaaatga 


ttccaagcat 


46320 


agaaataata acagctgaca 


agtactgagt 


gccctctata 


tgctaggcac 


tgggctgagg 


46380 


gattaacatg catgtgcatg 


tttattcctc 


atgacaacct 


tggtttccag 


ataagctgga 


46440 


ctggaaaggg acagagctgg 


gatcctgggc 


taatcagtct 


ggtcgccaag 


cctgagactt 


46500 


tagccactgc ccttcacatg 


ggggtccatg 


aaaatagtag 


tagtctggaa 


cagtttgggg 


46560 


gtacatcaag gtcgctgtgt 


tttaagctat 


ggagtctgga 


ctataggaga 


caaatgtaaa 


46620 


agagtttttt ggttgactgg 


ctttttggtt 


tttttgtttg 


tttgtttgtt 


tgtttgtttg 


46680 


tttgtttgtt ttttcctgtt 


tctggggctt 


gaatcaggaa 


ggaggttttt 


ttgttgttgt 


46740 


tgttttgaga aaggatattg 


ctctgttgcc 


cagactggag 


tgcagtggca 


cgatcatggc 


46800 


tcactacagc ttcgacctcc 


tgggctcaag 


caatcctcct 


gccttagcct 


cccaagtagc 


46860 


tggactacag gtgtgtacca 


ccacacctaa 


ttttttgaat 


ttttttttct 


tttttttttt 


46920 


tttttttttt ggtagagaca 


ggttctcact 


ttgttgccca 


ggcctgaatc 


tceaactcct 


46980 


gggctcaagc attcctcctg 


cctcgccctc 


ccaaagtgtt 


gggattacag 


ttgtgagcca 


47040 


ccatgcccgg caggaaaaga 


tttttaagca 


agaaagctta 


agagctgtgg 


tttttccaaa 


47100 


atgagtctgg gctggcacag 


tggctcatgc 


ctgtaatccc 


agcacttttt 


tgggaggccg 


47160 


aggtgagtgg atcacttgag 


gtcaggagtt 


tgagaccagc 


ctggccaact 


ggtgaaaccc 


47220 


ctgtttctac taaagaaaaa 


aatgcaaaaa 


ttagctgggc 


gtggtggtgc 


acgcctgtag 


47280 


tcccagctac tcaggaggcc 


gaggcaggag 


aatagcttga 


acctgggagg 


cagaagttgc 


47340 


agtgagccaa gatcacacca 


ctgcattcca 


gcctgggtga 


cagagtgaga 


cttcatctca 


47400 


aaaaaaaaaa aaaagagaga 


ctgatatggt 


tagtacattg 


gggtggaatg 


cggagggtcc 


47460 


agggaatgga gccctgcata 


gggggctaat 


gaaacatttc 


agatttctga 


attaaggtag 


47520 


tggctgtggg gacaggagcc 


tgggaggcag 


ggtggagtca 


gaatggagag 


actggttggc 


47580 


aatgagggaa caggaggagg 


aggaggagga 


gttacgagtg 


gcttgaggtg 


tcacttacca 


47640 


gacatttggg ggatggggga 


tagccgtgat 


tgttgagcaa 


ctggtttggg 


aagagctagc 


47700 


attgatccct gctgttctgt 


gctagcagaa 


cctatcagca 


tcttctgggc 


aggaaactgg 


47760 


ctccatgaga ctggcttagg 


gagaggctgc 


tagtcaccta 


atctgcagag 


aaggggcagc 


47820 


tggagctgtg ggacagaaga 


ggcatccatg 


tagctggtgg 


gggtgtctca 


gcttgtgaag 


47B80 


aggagatggc tttgagcagg 


gctgacactg 


aaaaggctgg 


aagaaaaaaa 


cagacacaca 


47940 


agagtctcag gatcaggtag 


cataggaaag 


ttgtggacag 


tctttgagga 


gcactccctc 


48000 


aggcaggcag gcaggcaggt 


catgagctat 


agcgattcag 


gaagagctcc 


ctgggtgtgt 


48060 


gagcagctcc aggagcctaa 


gggatgaaag 


tagtattgca 


gggggctgga 


gagcaaggag 


48120 
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tggctcctte tacatttgca agggaaggag aaaggaagtt gctcctgaga gtggtaagag 48180 
tcagtggtgg aggcctggag aggagacata acaaacaaat ttgttgacaa acattttggt 48240 
aggaaggggg agagcttaaa gtttagacag tggggaaggt ggagtcttag aggaggtgaa 48300 
tgtctgaaag acagagctag ctggagcaag aagtcacttc tctgttgcag gcaggaagga 48360 
tccaaagtgg ctcaagccag agattgggag agtggggagg agggagcagc ctggatctaa 48420 
gtaaaatggg tagaggtgga gggggtgctg caacggccag ggttttctga agttggggac 48480 
attaggagag agctgtgagg gctttggcca gccactgtgc tagtgattgg tgaaccaaag 48540 
gatgggcagg agatggcagc agggaagcag aggaagtcca ggcttcctgt tggtattggg . 48600 
acaagggaga ggccatagga ggccctggcc ctgttgtcca ggttgggttc tgaagctggg 48660 
tgggcatggc ctggtaggag agcatctatg gcgcccaatt ccagattcag ggtctagttg 48720 
atttgctggc cctgtagcct cagctcatgc ttctgttcca ggcctatttg cactctatgt 48780 
gcatcatcca ccgggatctg aactcgcaca actgcctcat caagttggta tgtcccactg 48840 
ctctgggcct ggcctccagg gtcctatcct tcctggcttc cttgtcacaa aggaggctga 48900 
cttgtcccct ctggctagag ggcagaggtg ttgcctagga gctcctatct ttcccttcct 48960 
gcttcttcca atgcccttct ctgtcctctg ggagctccga gacacacaca gacataattt 49020 
caccttctct cattagcaac ctttgaaata atttgattag aagggacttc agaagtttgt 49080 
tgactatatg tagaaaaccc tgtcatttta cctgcttttg ccccatagta gtcttgtaaa 49140 
acagttcatt gctgacccca ttttacagtg gtggcacctg aagcctcagc ctgaggccac 49200 
cgagctagta aatttacagg gaccagtttg agaccagcat tcctcccact gcccctcagc 49260 
tgtggtggtt acaatgttgt ttgtcttact gacttgctat ctggcttcct gggtgtctac 49320 
cggctggccc tggctctgcc ctctagaccc acaccacgca atcttcattc ctttcccaca 49380 
tgactgccct gtagctattc aaagagcttg tctcccccaa gtctccccat ctactgcctc 49440 
caccttgcct ttttctgtct tatcctggtt ctagccactg cctgaaatca ttttaggaat" 49500 
aagacaggac agggaaaaac aaaagcaacc ccctgtccca cctctgagtt ccactctcca 49560 
agtccctgag cctcacctcc agggctccag tggctctgcc atgaacccac tgtgggctgg 49620 
gagtctgctg tgcacagata ccagaccctc agaaacacaa atgccaagtg tgtctgtttt 49680 
tttgttttgt tttgttttgt tttttagatg gagtctcatt ctgtttccca ggctggagtg 49740 
cagtggtgca atcttggctt actgcagcct ctacctcccg ggttctagtg attgttctgc 49800 
ttcagcctcc cagtagctag gactacaggc gtgtgccacc acgcccagct aatttttttt 49860 
tttttttttt tgtattttta gtagagacag ggttttgcca tgttggccag gctggtcttg 49920 
aactcctgac ctcaggtgat tcacccgcct tggcctccca aagttctggg attacaggtg 49980 
gaagccaccg tgcctggcct gagtgtgtct atttgataga gctttctgct ctgattctcc 50040 
cttgctatac accttttctc cccttctcag tggcttctct tgcctatgct tcctccccag 50100 
ggccaggttt gagaacatcc ccatgaagtc ctgacctgtc ttttatccta ccaggacaag 50160 
actgtggtgg tggcagactt tgggctgtca egg etc a tag tggaagagag gaaaagggee 50220 
cccatggaga aggccaccac caagaaaege accttgcgca agaacgaccg caagaagege 50280 
tacacggtgg tgggaaaccc ctactggatg gcccctgaga tgetgaaegg tgagtcctga 50340 
agecctggag gggacacccg cagagggagg acagatgetg cccttgcatc agagecctgg 50400 
gaattccagg ggaggcctgt gaagegtagg accggatacc cagagctgag gatatttttc 50460 
ccttgccagg tggggectea cgatttagct cctgagctca gggggctggg aactgatcag 50520 
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tgtcccatca tgggggataa ggtgagttct gactgtggca tttgtgcctc agggatcgct 50580 
aagagctcag gctattgtcc cagctttagc cttctctctc catggtgaga actgaagtgt. 50640 
ggtgccctct ggtggataat gctcaaacca accagagatg ctggttggga ttcttgaaat 50700 
cagggttgtg aggcctcaga aatggtctga atacaatcca ttttggagtc tgaggcccag 50760 
agaagttcag tgaattgcct aggagcatac agctgcctaa tggcagaggc tagatgaacc 50820 
ctagtctggt tcttttccac tttaacgtgc. agtttcatcc taggcagtgt tatgttataa 508B0 
gggctctcca aggcagttca cctacggctg aggaaggact attttcaggt ggtgtctgcg 50940 
caggacagcc tgtggggtgt ccctacagaa cctgttctag ccctagttct tagctgtggc 51000 
ttagattgac cctagaccca gtgcagagca ggtaagggat gtaaacttaa cagtgtgctc 51060 
tcctgtgttc cccaaggaaa gagctatgat gagacggtgg atatcttctc ctttgggatc 51120 
gttctctgtg aggtgagctc tggcaccaag gccatgcccg aggcagcagg cctagcagct 51180 
ctgccttccc tcggaactgg ggcatctcct cctagggatg actagcttga ctaaaatcaa 51240 
catgggtgta gggttttatg gtttataacg catctgcaca tctttgccac gttcgtgttt 51300 
cattggtctt aagagaagga ctggcagggt ttttttgttt tagatggagc ctcacttcgt 51360 
tgcccaggct ggagtgcagt ggcacaatct gggctcactg caacctctgc cttctgggtt 51420 
caagtgattc tcctgcctca gcctcccaag tagctgggac taccggcaca caccaccatg 51480 
cccggctaat ttttgtattt ttagtagaga cagggtttca ccatgttggc caggctggtc 51540 
ttgaactccg gacctcaggt gatccgcc-tg cctcagcctc taaaagtgct ggaattaata 51600 
ggcgtgagct acctcgcccg gccaggtttt tttttttttt tttttagttg aggaaactga 51660 
ggc-ttggaag agggcag-tgg ct-tgcacatg gtcgataagg ggcagatgag actcagaatt 51720 
ccagaaggaa gggcaagaga ctgttcatgt ggctgtctag ctagctcttg ggccaaatgt 51780 
agcccttctc agttcccttc aagtagaagt agccactcta ggaagtgtca gccctgtgcc 51840 
aggtaccacg tggacagagt gaggaatctt ggaaagattc ctacctttag gagtttagtc 51900 
aggtgacagc atatctcagc gactcaaaca cacacacatt caaagccttc tgtaattcct 51960 
acaaagttgt gaggggtaga ggagaggaga gacaagggat ggttaggata atgaaggaat 52020 
gttttgtttt tgtttttgtt tttgagatgg agtttcactc tgtcacccag gctggagtgc 52080 
agaggbgcaa tcttggctca ctgcagcctc cgcctcccag gttcaagcaa tcctcctgcc 52140 
tcagcctccc aagtagctgg gactacaggt gtgcgccacc acgcctggct aatttttgta 52200 
ttttcagtag agacagggtt tcgccatatt ggccaggctg gtctcaaatg cctgacctca 52260 
ggtgatacac ccgcttcagc ctcccaaagt gctgagatta caggcatgag ctaccgtgcc 52320 
tggccatgaa ggaagatttg ttttaaaaaa ttgttttctt taatattaat tgaacacctc 52380 
tgttcagagc actgggctgg tgccagaggg tttcagacat gaatcagatc cagcacctca 52440 
tagagcctta atctggcaca cacacacagc cacaaggaga cacagacaag gcagggtagg 52500 
atgagtggaa gctaggagca gatgctgatt tggaacactt ggcttctgca gtgaagcccc 52560 
ttcttagtcc tcttcagtaa cccagctctc agtggataca ggtctggatt agtaagattt 52620 
ggagagatga ttggggattg gggagagctc tctaacctat tttaccacct cctcttctgc 52680 
cattcttcct gtccacatcc ccagcatccc tttcccttgc caagtatctg tggcctctgt 52740 
agtcctttgt aaacagctgt cttcttaccc tacagatcat tgggcaggtg tatgcagatc 52800 
ctgactgcct tccccgaaca ctggactttg gcctcaacgt gaagcttttc tgggagaagt 52860 
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ttgttcccac agattgtccc ccggccttct tcccgctggc cgccatctgc tgcagactgg 52920 
agcctgagag caggttggta tcctgccttt ttctcccagc tcacagggtc ctgggacgtt 52980 
tgcctctgtc taaggccacc cctgagccct ctgcaagcac aggggtgaga gaagccttga 53040 
ggtcaagaat gtggctgtca acccctgagc catctgacaa cacatatgta caggttggag 53100 
aagagagagg taaagacata gcagcaagta atctggatag gacacagaaa cacagccatt 53160 
aaaagaaagt ttaaaagaag gaaattcacc caaaccattt gaatacagta agtgtattca 53220 
tctttcgata ttcccctgtc catatctaca catatacttt tttttatagt aaatagttct 53280 
gtattttgcc ctgcatttcc cttgtgttta ctatccagtc ttcctgttta tcatttttgt 53340 
cgacaacatg aaattctatt gagagactgt ctgaacatat tgtaatgtag atgttcaggt 53400 
ttttccagtt tctctttaca ataggtattt aactacagtg agcagtttta tgcatttagc 53460 
taatttctcc tttgaggaag tattttcaaa attaccttta ttcttctcag gtaataattt 53520 
cattattacc aaagttaccc taggtctttt caagtgtgtg gttaaaaaac gagaatctgg 53580 
ctgggcgcga tggctcacac ctgtaatccc agcactttgg gaggctgagg ctggtggatc 53640 
acctgaggtc tggagttcga gaccagcctg gccaacatgg tgaaacccca tctctactaa 53 700 
aaatacaaaa cttagccagg catggtggca ggtgcctgta accccagcta cttgggaggc 53760 
tgaggcagga gaattgcttg aacccagggg cggaggttgc agtgagccga tatcacgcca 53820 
ttgcactcca gcctcggcaa caagagtgaa actctgtctc aaaaatgggg ttcttttcct 53880 
gccatcaaaa atcatgtttc ttttaaaaac aagttcaaac attaccaaag tttatagcac 53940 
aggaaatacg tcttctgtaa tctcccttaa ccaatatatc cctcaacatt ctcctcaccc 54000 
ccaactccac cctcccagga taaccagttg ggacataatc tttatttaaa aatggtttcc 54060 
ggatagagaa agcgcttcgg cggcggcagc cccggcggcg gccgcagggg acaaagggcg 54120 
ggcggatcgg cggggagggg gcggggcgcg accaggccag gcccgggggc tccgcatgct 54180 
gcagctgcct ctcgggcgcc cccgccgccg ccctcgccgc ggagccggcg agctaacctg 54240 
agccagccgg cgggcgtcac ggaggcggcg gcacaaggag gggccccacg cgcgcacgtg 54 300 
gccccggagg ccgccgtggc ggacagcggc accgcggggg gcgcggcgtt ggcggccccg 54 360 
gccccggccc ccaggccagg cagtggcggc caaggaccac gcatctaett tcagagcccc 54 420 
ccccggggcc gcaggagagg gcccgggctg ggcggatgat gagggcccag tgaggcgcca 54480 
agggaaggtc accatcaagt atgaccccaa ggagctacgg aagcacctca acctagagga 5454 0 
gtggatcctg gagcagctca cgcgcctcta cgactgccag gaagaggaga tctcagaac-b 54 600 
agagattgac gtggatgagc tcctggacat ggagagtgac gatgcctggg cttccagggt 54660 
caaggagctg ctggttgact gttacaaacc cacagaggcc ttcatctctg gcctgctgga 54720 
caagatccgg gccatgcaga agctgagcac accccagaag aagtgagggt ccccgaccca 54780 
ggcgaacggt ggctcccata ggacaatcgc taccccccga cctcgtagca acagcaatac 54840 
cgggggaccc tgcggccagg cctggttcca tgagcagggc tcctcgtgcc cctggcccag 54 900 
gggtctcttc ccctgccccc tcagttttcc aettttggat ttttttattg ttattaaact 54960 
gatgggactt tgtgttttta tattgactct gcggcacggg ccctttaata aagcgaggta 55020 
gggtacgcct ttggtgcagc tcaaaaaaaa aaaaaaaaat gatttccagc ggtccacatt 55080 
agagttgaaa ttttctggtg ggagaatcta taccttgttc ctttataggc caaggaccgc 55140 
agtccttcag taacaccagt gtaaaagctt gaggagaaat tgtgaagcta cacagtattt 55200 
gttttctaat acctcttgtc attctaaata tctttaattt attaaaaaat atatatatac 55260 
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agtattgaat gcctactgtg tgctaggtac agttctaaac acttgggtta cagcagcgaa 55320 
caaaataaag gtgcttaccc tcatagaaca tagattctag catggtatct actgtatcat 55380 
acagtagata caataagtaa actatattga atattagaat gtggcagatg ctatggaaaa 55440 
agagtcaaga caagtaaaga cgattgttca gggtaccagt tgcaatttta aatatggtcg 55500 
tcagagcagg cctcactgag gtgacatgac atttaagcat aaacatggag gaggaggagt 55560 
aagcctgagc tgtcttaggc ttccggggca gccaagccat ttccgtggca ctaggagcct 55620 
ggtgtttccg attccacctt tgataactgc attttctcta agatatggga gggaagtttt 55680 
tctcctattg tttttaagta ttaactccag ctagtccagc cttgttatag tgttacctaa 55740 
tctttatagc aaatatatga ggtaccggta acattatgcc catttctcac agaggcacta 55800 
ctaggtgaag gagtttgcct gacgttatac aaccaggaag tagctgagcc tagatccctt 55860 
ccacccaccc catggccctg ctcatgttcc acctgcctct aatttacctc ttttccttct 55920 
agaccagcat tctcgaaatt ggaggactcc tttgaggccc tctccctgta cctgggggag 55980 
ctgggcatcc cgctgcctgc agagctggag gagttggacc acactgtgag catgcagtac 56040 
ggcctgaccc gggactcacc tccctagccc tggcccagcc ccctgcaggg gggtgttcta 56100 
cagccagcat tgcccctctg tgccccattc ctgctgtgag cagggccgtc cgggcttcct 56160 
gtggattggc ggaatgttta gaagcagaac aagccattcc tattacctcc ccaggaggca 56220 
agtgggcgca gcaccaggga aatgtatctc cacaggttct ggggcctagt tactgtctgt 56280 
aaatccaata cttgcctgaa agctgtgaag aagaaaaaaa cccctggcct ttgggccagg 56340 
aggaatctgt tactcgaatc cacccaggaa ctccctggca gtggattgtg ggaggctctt 56400 
gcttacacta atcagcgtga cctggacctg ctgggcagga tcccagggtg aacctgcctg 56460 
tgaactctga agtcactagt ccagctgggt gcaggaggac ttcaagtgtg tggacgaaag 56520 
aaagactgat ggctcaaagg gtgtgaaaaa gtcagtgatg ctcccccttt ctactccaga_ 56580 
tcctgtcctt cctggagcaa ggttgaggga gtaggttttg aagagtccct taatatgtgg 56640 
tggaacaggc caggagttag agaaagggct ggcttctgtt tacctgctca ctggctctag 56700 
ccagcccagg gaccacatca atgtgagagg aagcctccac ctcatgtttt caaacttaat 56760 
actggagact ggctgagaac ttacggacaa catcctttct gtctgaaaca aacagtcaca 56820 
agcacaggaa gaggctgggg gactagaaag aggccctgcc ctctagaaag ctcagatctt 56880 
ggcttctgtt actcatactc gggtgggctc cttagtcaga tgcctaaaac attttgccta .56940 
aagctcgatg ggttctggag gacagtgtgg cttgtcacag gcctagagtc tgagggaggg 57000 
gagtgggagt ctcagcaatc tcttggtctt ggcttcatgg caaccactgc tcacccttca . 57060 
acatgcctgg tttaggcagc agcttgggct gggaagaggt ggtggcagag tctcaaagct 57120 
gagatgctga gagagatagc tccctgagct gggccatctg acttctacct cccatgtttg 57180 
ctctcccaac tcattagctc ctgggcagca tcctcctgag ccacatgtgc aggtactgga 57240 
aaacctccat cttggctccc agagctctag gaactcttca tcacaactag atttgcctct 57300 
tctaagtgtc tatgagcttg caccatattt aataaattgg gaatgggttt ggggtattaa 57360 
tgcaatgtgt ggtggttgta ttggagcagg gggaattgat aaaggagagt ggttgctgtt 57420 
aatattatct tatctattgg gtggtatgtg aaatattgta catagacctg atgagttgtg 57480 
ggaccagatg tcatctctgg tcagagttta cttgctatat agactgtact tatgtgtgaa 57540 
gtttgcaagc ttgctttagg gctgagccct ggactcccag cagcagcaca gttcagcatt 57600 
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gtgtggctgg ttgtttcctg gctgtcccca gcaagtgtag gagtggtggg cctgaactgg 57660 

gccattgatc agactaaata aattaagcag ttaacataac tggcaatatg gagagtgaaa 57720 

acatgattgg ctcagggaca taaatgtaga gggtctgcta gccaccttct ggcctagccc 57780 

acacaaactc cccatagcag agagttttca tgcacccaag tctaaaaccc tcaagcagac 57840 

acccatctgc tctagagaat atgtacatcc cacctgaggc agccccttcc ttgcagcagg 57900 

tgtgactgac tatgaccttt tcctggcctg gctctcacat gccagctgag tcattcctta 57960 

ggagccctac cctttcatcc tctctatatg aatacttcca tagcctgggt atcctggctt 58020 

gctttcctca gtgctgggtg ccacctttgc aatgggaaga aatgaatgca agtcacccca 58080 

ccccttgtgt ttccttacaa gtgcttgaga ggagaagacc agtttcttct tgcttctgca 58140 

tgtgggggat gtcgtagaag agtgaccatt gggaaggaca atgctatctg gttagtgggg 58200 

ccttgggcac aatataaatc tgtaaaccca aaggtgtttt ctcccaggca ctctcaaagc 58260 

ttgaagaatc caacttaagg acagaatatg gttcccgaaa aaaactgatg atctggagta 58320 

cgcattgctg gcagaaccac agagcaatgg ctgggcatgg gcagaggtca tctgggtgtt 58380 

cctgaggctg ataacctgtg gctgaaatcc cttgctaaaa gtccaggaga cactcctgtt 58440 

ggtatctttt cttctggagt catagtagtc accttgcagg gaacttcctc agcccagggc 58500 

tgctgcaggc agcccagtga cccttcctcc tctgcagtta ttcccccttt ggctgctgca 58560 

gcaccacccc cgtcacccac cacccaaccc ctgccgcact ccagccttta acaagggctg 58620 

tctagatatt cattttaact- acctccacct tggaaacaat tgctgaaggg gagaggattt 58680 

gcaatgacca accaccttgt tgggaogcct gcacacctgt ctttcctgct tcaacctgaa 58740 

agattcctga tgatgataat ctggacacag aagccgggca cggtggctct agcctgtaat 58800 

ctcagcactt tgggaggcct cagcaggtgg atcacctgag atcaagagtt tgagaacagc 58B60 

ctgaccaaca tggtgaaacc ccgtctctac taaaaataca aaaattagcc aggtgtggtg 58920 

gcacatacct gtaatcccag ctactctgga ggctgaggca ggagaatcgc ttgaacccac 58980 

aaggcagagg ttgcagtgag gcgagatcat gccattgcac tccagcctgt gcaacaagag 59040 

ccaaactcca tctcaaaaaa aaaaa 59065 



<210> SEQ XD NO 4 
<211> LENGTH t 265 
<212> TYPE: PRT 
<213> ORGANISM: Human 

<400> SEQUENCE: 4 

Leu Thr Glu Val Lya Val Met Arg Ser Leu Asp His Pro Aon Val Leu 
1 5 10 15 

Lys Phe He Gly Val Leu Tyr Lys Asp Lye Lys Leu Asn Leu Leu Thr 
20 25 30 

Glu Tyr He Glu Gly Gly Thr Leu Lys Asp Phe Leu Arg Ser Met Asp 
35 40 45 

Pro Phe Pro Trp Gin Gin Lys Val Arg Phe Ala Lys Gly lie Ala Ser 
50 55 60 

Gly Met Ala Tyr Leu Hio Ser Met Cys He He His Arg Asp Leu Asn 
65 70 75 80 

Ser His Asn Cys Leu He Lys Leu Asp Lys Thr Val Val Val Ala Asp 

85 90 95 

Phe Gly Leu Ser Arg Leu He Val Glu Glu Arg Lys Arg Ala Pro Met 
100 105 HO 
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Glu Lys Ala Thr Thr Lys Lys Arg Thr Leu Arg Lys Asn Asp Arg Lys 
115 120 125 

Lys Arg Tyr Thr Vol Val Gly Asn Pro Tyr Trp Met Ala Pro Glu Met 
130 135 140 

Leu Asn Gly Lys Ser Tyr Asp Glu Thr Val Asp lie Phe Ser Phe Gly 
145 150 155 160 

lie Val Leu Cys Glu lie He Gly Gin Val Tyr Ala Asp Pro Asp Cys 

165 170 175 

Leu Pro Arg Thr Leu Asp Phe Gly Leu Asn Val Lys Leu Phe Trp Glu 
180 185 190 

Lys Phe Val Pro Thr Asp Cys Pro Pro Ala Phe Phe Pro Leu Ala Ala 
195 200 205 

He Cys Cys Arg Leu Glu Pro Glu Ser Arg Pro Ala Phe Ser Lys Leu 
210 215 220 

Glu Asp Ser Phe Glu Ala Leu Ser Leu Tyr Leu Gly Glu Leu Gly He 
225 230 235 240 

Pro Leu Pro Ala Glu Leu Glu Glu Leu Asp His Thr Val Ser Met Gin 

245 250 255 

Tyr Gly Leu Thr Arg Asp Ser Pro Pro 
260 265 



That which is claimed is: 5. An isolated polynucleotide consisting of a nucleotide 

1. An isolated nucleic acid molecule consisting of a 30 sequence set forth in SEQ ID NO:l 

nucleotide sequence selected from the group consisting of: 6 ^ polynucleotide consistin of a nucleotide 

(a) a nucleotide sequence that encodes an amino acid sequence set forth in SEQ ID NO:3 

sequence shown in SEQ ID NO:2; * .. . 

/u\ * » 'a i i • fit , . 7. A vector according to claim 2, wherein said vector is 

id) a nucleic acid molecule consisting of the nucleic acid „ c #u c i *_i • . 

sequence of SEQ ID NOl* 35 selected from me ^ 0U P C0DSlstlI1 g of a plasmid, virus, and 

/ v . . . , ' bacteriophage. 

(c) a nucleic acid molecule consisting of the nucleic acid Q A . . . . - , . . , , , J 

sequence of SEQ ID NO:3; and 8 ' . VC ? 0T accordm S to claun 2 > wherein said isolated 

z . .j , , , nucleic acid molecule is inserted into said vector in proper 

( l™^ , S6 ?!T Ce 18 7w y com P lemen - orientation and correct reading frame such that the protein of 

tary to a nucleotide sequence of (aWc). 40 ccn Trk m ^ , & , . « r j - . 

2. A nucleic acid vector comprising a nucleic acid mol- SE ? ID N ° :2 may * CXprCSSed by a CeU 
ecule of claim 1. said vector. 

3. A host cell containing the vector of claim 2. 9. A vector according to claim 8, wherein said isolated 

4. A process for producing a polypeptide comprising nucleic acid molecule is operatively linked to a promoter 
culturing the host cell of claim 3 under conditions sufficient 45 sequence. 

for the production of said polypeptide, and recovering the 

peptide from the host cell culture. * * * * * 
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GPR 25 POLYNUCLEOTIDES 

FIELD OF THE INVENTION 

This invention relates to newly identified polypeptides 
and polynucleotides encoding such polypeptides, to their use 
in therapy and in identifying compounds which may be 
agonists, antagonists and/or inhibitors which are potentially 
useful in therapy, and to production of such polypeptides and 
polynucleotides. 

BACKGROUND OF THE INVENTION 

The drug discovery process is currently undergoing a 
fundamental revolution as it embraces functional 
genomics*, that is, high throughput genome- or gene-based 
biology. This approach is rapidly superceding earlier 
approaches based on 'positional cloning*. A phenotype, that 
is a biological function or genetic disease, would be iden- 
tified and this would then be tracked back to the responsible 
gene, based on its genetic map position. 

Functional genomics relies heavily-;:>n the various tools of 
bioinformatics to identify gene sequences of potential inter- 
est from the many molecular biology databases now avail- 
able. There is a continuing need to identify and characterise 
further genes and their related polypeptides/proteins, as 
targets for drug discovery. 

It is well established that many medically significant 
biological processes are mediated by proteins participating 
in signal transduction pathways that involve G-proteins 
and/or second messengers, e.g., cAMP (Lefkowitz, Nature, 
1991, 351:353-354). Herein these proteins are referred to as 
proteins participating in pathways with G-proteins or PPG 
proteins. Some examples of these proteins include the GPC 
receptors, such as those for adrenergic agents and dopamine 
(Kobilka, B. K., et al, Proc. Natl Acad. Sci., USA, 1987, 
84:46-50; Kobilka, B. K., et al., Science, 1987, 
238:650-656; Bunzow, J. R., et al., Nature, 1988, 
336:783-787), G-proteins themselves, effector proteins, 
e.g., phospholipase C, adenyl cyclase, and 
phosphodiesterase, and actuator proteins, e.g., protein kinase 
A and protein kinase C (Simon, M. I., et al., Science, 1991, 
252:802^8). 

For example, in one form of signal transduction, the effect 
of hormone binding is activation of the enzyme, adenylate 
cyclase, inside the cell. Enzyme activation by hormones is 
dependent on the presence of the nucleotide, GTP. GTP also 
influences hormone binding. A G-protein connects the hor- 
mone receptor to adenylate cyclase. G-protein was shown to 
exchange GTP for bound GDP when activated by a hormone 
receptor. The GTP-carrying form then binds to activated 
adenylate cyclase. Hydrolysis of GTP to GDP, catalyzed by 
the G-protein itself, returns the G-protein to its basal, 
inactive form. Thus, the G-protein serves a dual role, as an 
intermediate that relays the signal from receptor to effector, 
and as a clock that controls the duration of the signal. 

The membrane protein gene supcrfamily of G-protein 
coupled receptors has been characterized as having seven 
putative transmembrane domains. The domains are believed 
to represent transmembrane a-helices connected by extra- 
cellular or cytoplasmic loops. G-protein coupled receptors 
include a wide range of biologically active receptors, such as 
hormone, viral, growth factor and neuroreceptors. 

G-protein coupled receptors (otherwise known as 7TM 
receptors) have been characterized as including these seven 
conserved hydrophobic stretches of about 20 to 30 amino 
acids, connecting at least eight divergent hydrophilic loops. 
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The G-protein family of coupled receptors includes dopam- 
ine receptors which bind to neuroleptic drugs used for 
treating psychotic and neurological disorders. Other 
examples of members of this family include, but are not 

5 limited to, calcitonin, adrenergic, endothelin, cAMP, 
adenosine, muscarinic, acetylcholine, serotonin, histamine, 
thrombin, kinin, follicle stimulating hormone, opsins, endot- 
helial differentiation gene-1, rhodopsins, odorant, and 
cytomegalovirus receptors. Most G-protein coupled recep- 

10 tors have single conserved cysteine residues in each of the 
first two extracellular loops which form disulfide bonds that 
are believed to stabilize functional protein structure. The 7 
transmembrane regions are designated as TMl, TM2, TM3, 
TM4, TM5, TM6, and TM7. TM3 has been implicated in 

15 signal transduction. 

Phosphorylation and lipidation (palmitylation or 
farnesylation) of cysteine residues can influence signal 
transduction of some G-protein coupled receptors. Most 
G-protein coupled receptors contain potential phosphoryla- 
tion sites within the third cytoplasmic loop and/or the 
carboxy terminus. For several G-protein coupled receptors, 
such as the (3-adrenoreceptor, phosphorylation by protein 
kinase A and/or specific receptor kinases mediates receptor 
desensitization. 

For some receptors, the ligand binding sites of G-protein 
coupled receptors are believed to comprise hydrophilic 
sockets formed by several G-protein coupled receptor trans- 
membrane domains, said sockets being surrounded by 
hydrophobic residues of the G-protein coupled receptors. 
The hydrophilic side of each G-protein coupled receptor 
transmembrane helix is postulated to face inward and form 
a polar ligand binding site. TM3 has been implicated in 
several G-protein coupled receptors as having a ligand 
binding site, such as the TM3 aspartate residue. TM5 
serines, a TM6 asparagine and TM6 or TM7 phenylalanines 
or tyrosines are also implicated in ligand binding. 

G-protein coupled receptors can be intracellularly 
coupled by heterotrimeric G-proteins to various intracellular 
enzymes, ion channels and transporters (see, Johnson et al., 
Endoc. Rev., 1989, 10:317-331). Different G-protein 
a-subunits preferentially stimulate particular effectors to 
modulate various biological functions in a cell. Phosphory- 
lation of cytoplasmic residues of G-protein coupled recep- 
tors has been identified as an important mechanism for the 
regulation of G-protein coupling of some G-protein coupled 
receptors. G-protein coupled receptors are found in numer- 
ous sites within a mammalian host. Over the past 15 years, 
nearly 350 therapeutic agents targeting 7 transmembrane (7 
TM) receptors have been successfully introduced onto the 
market. 

SUMMARY OF THE INVENTION 

The present invention relates to GPR25, in particular 
55 GPR25 polypeptides and GPR25 polynucleotides, recombi- 
nant materials and methods for their production. In another 
aspect, the invention relates to methods for using such 
polypeptides and polynucleotides, including the treatment of 
infections such as bacterial, fungal, protozoan and viral 
60 infections, particularly infections caused by HIV- 1 or HIV-2, 
pain; cancers; diabetes, obesity; anorexia; bulimia; asthma; 
Parkinson's disease; acute heart failure; hypotension; hyper- 
tension; urinary retention; osteoporosis; angina pectoris; 
myocardial infarction; stroke; ulcers; asthma; allergies; 
65 benign prostatic hypertrophy; migraine; vomiting; psychotic 
and neurological disorders, including anxiety, 
schizophrenia, manic depression, depression, delirium, 
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dementia, and severe mental retardation; and dyskinesias, 
such as Huntington's disease or Gilles dela Tourett's 
syndrome, hereinafter referred to as "the Diseases", amongst 
others. In a further aspect, the invention relates to methods 
for identifying agonists and antagonists/inhibitors using the 5 
materials provided by the invention, and treating conditions 
associated with GPR25 imbalance with the identified com- 
pounds. In a still further aspect, the invention relates to 
diagnostic assays for detecting diseases associated with 
inappropriate GPR25 activity or levels. 10 

DESCRIPTION OF THE INVENTION 

In a first aspect, the present invention relates to GPR25 
polypeptides. Such peptides include isolated polypeptides 
comprising an amino acid sequence which has at least 70% 15 
identity, preferably at least 80% identity, more preferably at 
least 90% identity, yet more preferably at least 95% identity, 
most preferably at least 97-99% identity, to that of SEQ ID 
NO:2 over the entire length of SEQ ID NO:2. Such polypep- 
tides include those comprising the amino acid of SEQ ID 20 
NO:2. 

Further peptides of the present invention include isolated 
polypeptides in which the amino acid sequence has at least 
70% identity, preferably at least 80% identity, more prefer- 
ably at least 90% identity, yet more preferably at least 95% 
identity, most preferably at least 97-99% identity, to the 
amino acid sequence of SEQ ID NO:2 over the entire length 
of SEQ ID NO:2. Such polypeptides include the polypeptide 
of SEQ ID NO:2. 

Further peptides of the present invention include isolated 
polypeptides encoded by a polynucleotide comprising the 
sequence contained in SEQ ID NO:l. 

Polypeptides of the present invention are believed to be 
members of the G -protein coupled, 7 transmembrane recep- 35 
tor gene family of polypeptides. They are therefore of 
interest because this protein sequence, which was obtained 
by PCR amplification of spleen cDNA, has six amino acid 
differences from the published sequence for GPR25 having 
GenBank Accession No. U91939, which published sequence 40 
is believed to contain errors. These properties are hereinafter 
referred to as "GPR25 activity" or "GPR25 polypeptide 
activity" or "biological activity of GPR25". Also included 
amongst these activities are antigenic and immunogenic 
activities of said GPR25 polypeptides, in particular the 45 
antigenic and immunogenic activities of the polypeptide of 
SEQ ID NO: 2. Preferably, a polypeptide of the present 
invention exhibits at least one biological activity of GPR25. 

The polypeptides of the present invention may be in the 
form of the "mature" protein or may be a part of a larger 50 
protein such as a fusion protein. It is often advantageous to 
include an additional amino acid sequence which contains 
secretory or leader sequences, pro -sequences, sequences 
which aid in purification such as multiple histidine residues, 
or an additional sequence for stability during recombinant 55 
production. 

The present invention also includes include variants of the 
aforementioned polypeptides, that is polypeptides that vary 
from the referents by conservative amino acid substitutions, 
whereby a residue is substituted by another with like char- 60 
acteristics. Typical such substitutions are among Ala, Val, 
Leu and lie, among Ser and Thr; among the acidic residues 
Asp and Glu; among Asn and Gin, and among the basic 
residues Lys and Arg; or aromatic residues Phe and Tyr. 
Particularly preferred are variants in which several, 5-10, 65 
1-5, 1-3, 1-2 or 1 amino acids are substituted, deleted, or 
added in any combination. 



Polypeptides of the present invention can be prepared in 
any suitable manner. Such polypeptides include isolated 
naturally occurring polypeptides, recombinantly produced 
polypeptides, synthetically produced polypeptides, or 
polypeptides produced by a combination of these methods. 
Means for preparing such polypeptides are well understood 
in the art. 

In a further aspect, the present invention relates to GPR25 
polynucleotides. Such polynucleotides include isolated 
polynucleotides comprising a nucleotide sequence encoding 
a polypeptide which has at least 70% identity, preferably at 
least 80% identity, more preferably at least 90% identity, yet 
more preferably at least 95% identity, to the amino acid 
sequence of SEQ ID NO:2, over the entire length of SEQ ID 
NO:2. In this regard, polypeptides which have at least 97% 
identity are highly preferred, whilst those with at least 
98-99% identity are more highly preferred, and those with 
at least 99% identity are most highly preferred. Such poly- 
nucleotides include a polynucleotide comprising the nucle- 
otide sequence contained in SEQ ID NO:l encoding the 
polypeptide of SEQ ID NO:2. 

Further polynucleotides of the present invention include 
isolated polynucleotides comprising a nucleotide sequence 
that has at least 70% identity, preferably at least 80% 
identity, more preferably at least 90% identity, yet more 
preferably at least 95% identity, to a nucleotide sequence 
encoding a polypeptide of SEQ ID NO:2, over the entire 
coding region. In this regard, polynucleotides which have at 
least 97% identity are highly preferred, whilst those with at 
least 98-99% identity are more highly preferred, and those 
with at least 99% identity are most highly preferred. 

Further polynucleotides of the present invention include 
isolated polynucleotides comprising a nucleotide sequence 
which has at least 70% identity, preferably at least 80% 
identity, more preferably at least 90% identity, yet more 
preferably at least 95% identity, to SEQ ID NO:l over the 
entire length of SEQ ID NO:l. In this regard, polynucle- 
otides which have at least 97% identity are highly preferred, 
whilst those with at least 98-99% identiy are more highly 
preferred, and those with at least 99% identity are most 
highly preferred. Such polynucleotides include a polynucle- 
otide comprising the polynucleotide of SEQ ID NO:l as 
well as the polynucleotide of SEQ ID NO:l, 

The invention also provides polynucleotides which are 
complementary to all the above described polynucleotides. 

The nucleotide sequence of SEQ ID NO:l shows homol- 
ogy with GPR25 (Jung, B. P. et al, Biochem. Biophys. Res. 
Commun. 230 (1), 69-72 (1997)). The nucleotide sequence 
of SEQ ID NO:l is a cDNA sequence and comprises a 
polypeptide encoding sequence (nucleotides 79 to 1161) 
encoding a polypeptide of 361 amino acids, the polypeptide 
of SEQ ID NO:2. 

The nucleotide sequence encoding the polypeptide of 
SEQ ID NO:2 may be identical to the polypeptide encoding 
sequence contained in SEQ ID NO:l or it may be a sequence 
other than the one contained in SEQ ID NO:l, which, as a 
result of the redundancy (degeneracy) of the genetic code, 
also encodes the polypeptide of SEQ ID NO: 2. The polypep- 
tide of the SEQ ID NO:2 is structurally related to other 
proteins of the G-protein coupled, 7 transmembrane receptor 
gene family, having homology and/or structural similarity 
with GPR25 (Jung, B. P. et at, Biochem. Biophys. Res. 
Commun. 230(1), 69-72 (1997)). 

Preferred polypeptides and polynucleotides of the present 
invention are expected to have, inter alia, similar biological 
functions/properties to their homologous polypeptides and 
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polynucleotides. Furthermore, preferred polypeptides and 
polynucleotides of the present invention have at least one 
GPR25 activity. 

Polynucleotides of the present invention may be obtained, 
using standard cloning and screening techniques, from a 
cDNA library derived from mRNAin cells of human Spleen, 
using the expressed sequence tag (EST) analysis (Adams, 
M D., et al. Science (1991) 252:1651-1656; Adams, M. D. 
et al., Nature, (1992) 355:632-634; Adams, M. D., et al., 
Nature (1995) 377 Supp:3-174). Polynucleotides of the 
invention can also be obtained from natural sources such as 
genomic DNA libraries or can be synthesized using well 
known and commercially available techniques. 

When polynucleotides of the present invention are used 
for the recombinant production of polypeptides of the 
present invention, the polynucleotide may include the cod- 
ing sequence for the mature polypeptide, by itself; or the 
coding sequence for the mature polypeptide in reading frame 
with other coding sequences, such as those encoding a leader 
or secretory sequence, a pre-, or pro- or prepro- protein 
sequence, or other fusion peptide portions. For example, a 
marker sequence which facilitates purification of the fused 
polypeptide can be encoded. In certain preferred embodi- 
ments of this aspect of the invention, the marker sequence is 
a hexa-histidine peptide, as provided in the pQE vector 
(Qiagen, Inc.) and described in Gentz et al., Proc Natl Acad 
Sci USA (1989) 86:821-324, or is an HA tag. The poly- 
nucleotide may also contain non-coding 5' and 3' sequences, 
such as transcribed, non-translated sequences, splicing and 
polyadenylation signals, ribosome binding sites and 
sequences that stabilize mRNA. 

Further embodiments of the present invention include 
polynucleotides encoding polypeptide variants which com- 
prise the amino acid sequence of SEQ ID NO:2 and in which 
several, for instance from 5 to 10, 1 to 5, 1 to 3, 1 to 2 or 1, 
amino acid residues are substituted, deleted or added, in any 
combination. 

Polynucleotides which are identical or sufficiently iden- 
tical to a nucleotide sequence contained in SEQ ID NO:l, 
may be used as hybridization probes for cDNA and genomic 
DNA or as primers for a nucleic acid amplification (PCR) 
reaction, to isolate full-length cDNAs and genomic clones 
encoding polypeptides of the present invention and to isolate 
cDNA and genomic clones of other genes (including genes 
encoding homologs and orthologs from species other than 
human) that have a high sequence similarity to SEQ ID 
NO:l. Typically these nucleotide sequences are 70% 
identical, preferably 80% identical, more preferably 90% 
identical, most preferably 95% identical to that of the 
referent. The probes or primers will generally comprise at 
least 15 nucleotides, preferably, at least 30 nucleotides and 
may have at least 50 nucleotides. Particularly preferred 
probes will have between 30 and 50 nucleotides. 

A polynucleotide encoding a polypeptide of the present 
invention, including homologs and orthologs from species 
other than human, may be obtained by a process which 
comprises the steps of screening an appropriate library under 
stringent hybridization conditions with a labeled probe hav- 
ing the sequence of SEQ ID NO:l or a fragment thereof; and 
isolating full-length cDNA and genomic clones containing 
said polynucleotide sequence. Such hybridization tech- 
niques are well known to the skilled artisan. Preferred 
stringent hybridization conditions include overnight incuba- 
tion at 42° C. in a solution comprising: 50% formamide, 
5xSSC (150 mM NaCl, 15 mM trisodium citrate), 50 mM 
sodium phosphate (pH7.6), 5x Denhardt's solution, 10% 
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dextran sulfate, and 20 microgram/ml denatured, sheared 
salmon sperm DNA; followed by washing the filters in 0.1 x 
SSC at about 65° C. Thus the present invention also includes 
polynucleotides obtainable by screening an appropriate 
library under stingent hybridization conditions with a 
labeled probe having the sequence of SEQ ID NO:l or a 
fragment thereof. 

The skilled artisan will appreciate that, in many cases, an 
isolated cDNA sequence will be incomplete, in that the 
region coding for the polypeptide is cut short at the 5' end of 
the cDNA. This is a consequence of reverse transcriptase, an 
enzyme with inherently low 'processivity* (a measure of the 
ability of the enzyme to remain attached to the template 
during the polymerisation reaction), failing to complete a 
DNA copy of the mRNA template during 1st strand cDNA 
synthesis. 

There are several methods available and well known to 
those skilled in the art to obtain full-length cDNAs, or 
extend short cDNAs, for example those based on the method 
of Rapid Amplification of cDNA ends (RACE) (see, for 
example, Frohman et al, PNAS USA 85, 8998-9002, 1988). 
Recent modifications of the technique, exemplified by the 
Marathon™ technology (Clontech Laboratories Inc.) for 
example, have significantly simplified the search for longer 
cDNAs. In the Marathon™ technology, cDNAs have been 
prepared from mRNA extracted from a chosen tissue and an 
'adaptor' sequence ligated onto each end. Nucleic acid 
amplification (PCR) is then carried out to amplify the 
'missing* 5' end of the cDNA using a combination of gene 
specific and adaptor specific oligonucleotide primers. The 
PCR reaction is then repeated using ' nested' primers, that is, 
primers designed to anneal within the amplified product 
(typically an adaptor specific primer that anneals further 3' 
in the adaptor sequence and a gene specific primer that 
anneals further 5 f in the known gene sequence). The prod- 
ucts of this reaction can then be analysed by DNA sequenc- 
ing and a full-length cDNA constructed either by joining the 
product directly to the existing cDNA to give a complete 
sequence, or carrying out a separate full-length PCR using 
the new sequence information for the design of the 5' primer. 

Recombinant polypeptides of the present invention may 
be prepared by processes well known in the art from 
genetically engineered host cells comprising expression 
systems. Accordingly, in a further aspect, the present inven- 
tion relates to expression systems which comprise a poly- 
nucleotide or polynucleotides of the present invention, to 
host cells which are genetically engineered with such 
expression sytems and to the production of polypeptides of 
the invention by recombinant techniques. Cell-free transla- 
tion systems can also be employed to produce such proteins 
using RNAs derived from the DNA constructs of the present 
invention. 

For recombinant production, host cells can be genetically 
engineered to incorporate expression systems or portions 
thereof for polynucleotides of the present invention. Intro- 
duction of polynucleotides into host cells can be effected by 
methods described in many standard laboratory manuals, 
such as Davis et al, Basic Methods in Molecular Biology 
(1986) and Sambrook et al., Molecular Cloning: A Labora- 
tory Manual, 2nd Ed., Cold Spring Harbor Laboratory Press, 
Cold Spring Harbor, N.Y. (1989). Preferred such methods 
include, for instance, calcium phosphate transfection, 
DEAE-dextran mediated transfection, transvection, 
microinjection, cationic lipid-mediated transfection, 
electroporation, transduction, scrape loading, ballistic intro- 
duction or infection. 

Representative examples of appropriate hosts include 
bacterial cells, such as streptococci, staphylococci, E. coli, 
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Streptomyces and Bacillus subtilis cells; fungal cells, such as 
yeast cells and Aspergillus cells; insect cells such as Droso- 
phila S2 and Spodoptera Sf9 cells; animal cells such as 
CHO, COS, HeLa, C127, 3T3, BIIK, HEK 293 and Bowes 
melanoma cells; and plant cells. 5 

A great variety of expression systems can be used, for 
instance, chromosomal, episomal arid virus-derived 
systems, e.g., vectors derived from bacterial plasmids, from 
bacteriophage, from transposons, from yeast episomes, from 
insertion elements, from yeast chromosomal elements, from 10 
viruses such as baculoviruses, papova viruses, such as 
SV40, vaccinia viruses, adenoviruses, fowl pox viruses, 
pseudorabies viruses and retroviruses, and vectors derived 
from combinations thereof, such as those derived from 
plasmid and bacteriophage genetic elements, such as 35 
cosmids and phagemids. The expression systems may con- 
tain control regions that regulate as well as engender expres- 
sion. Generally, any system or vector which is able to 
maintain, propagate or express a polynucleotide to produce 
a polypeptide in a host may be used. The appropriate 20 
nucleotide sequence may be inserted into an expression 
system by any of a variety of well-known and routine 
techniques, such as, for example, those set forth in Sam- 
brook et al, MOLECULAR CLONING, A LABORATORY 
MANUAL (supra). Appropriate secretion signals may be 25 
incorporated into the desired polypeptide to allow secretion 
of the translated protein into the lumen of the endoplasmic 
reticulum, the periplasmic space or the extracellular envi- 
ronment. These signals may be endogenous to the polypep- 
tide or they may be heterologous signals. 30 

If a polypeptide of the present invention is to be expressed 
for use in screening assays, it is generally preferred that the 
polypeptide be produced at the surface of the cell. In this 
event, the cells may be harvested prior to use in the 
screening assay. If the polypeptide is secreted into the 35 
medium, the medium can be recovered in order to recover 
and purify the polypeptide. If produced intracellularly, the 
cells must first be lysed before the polypeptide is recovered. 

Polypeptides of the present invention can be recovered 4Q 
and purified from recombinant cell cultures by well-known 
methods including ammonium sulfate or ethanol 
precipitation, acid extraction, anion or cation exchange 
chromatography, phosphocellulose chromatography, hydro - 
phobic interaction chromatography, affinity 45 
chromatography, hydroxy lap atite chromatography and lec- 
tin chromatography. Most preferably, high performance liq- 
uid chromatography is employed for purification. Well 
known techniques for refolding proteins may be employed 
to regenerate active conformation when the polypeptide is 5Q 
denatured during isolation and or purification. 

This invention also relates to the use of polynucleotides of 
the present invention as diagnostic reagents. Detection of a 
mutated form of the gene characterised by the polynucle- 
otide of SEQ ID NO:l which is associated with a dysfunc- 55 
tion will provide a diagnostic tool that can add to, or define, 
a diagnosis of a disease, or susceptibility to a disease, which 
results from under-expression, over-expression or altered 
expression of the gene. Individuals carrying mutations in the 
gene may be detected at the DNA level by a variety of 60 
techniques. 

Nucleic acids for diagnosis may be obtained from a 
subject's cells, such as from blood, urine, saliva, tissue 
biopsy or autopsy material. The genomic DNA may be used 
directly for detection or may be amplified enzymatically by 65 
using PCR or other amplification techniques prior to analy- 
sis. RNA or cDNA may also be used in similar fashion. 
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Deletions and insertions can be detected by a change in size 
of the amplified product in comparison to the normal geno- 
type. Point mutations can be identified by hybridizing ampli- 
fied DNA to labeled GPR25 nucleotide sequences. Perfectly 
matched sequences can be distinguished from mismatched 
duplexes by RNase digestion or by differences in melting 
temperatures. DNA sequence differences may also be 
detected by alterations in electrophoretic mobility of DNA 
fragments in gels, with or without denaturing agents, or by 
direct DNA sequencing (ee, e.g., Myers et al., Science 
(1985)230:1242). Sequence changes at specific locations 
may also be revealed by nuclease protection assays, such as 
RNase and SI protection or the chemical cleavage method 
(see Cotton et al., Proc Natl Acad Sci USA (1985) 
85:4397-4401). In another embodiment, an array of oligo- 
nucleotides probes comprising GPR25 nucleotide sequence 
or fragments thereof can be constructed to conduct efficient 
screening of e.g., genetic mutations. Array technology meth- 
ods are well known and have general applicability and can 
be used to address a variety of questions in molecular 
genetics including gene expression, genetic linkage, and 
genetic variability (see for example: M. Chee et al., Science, 
Vol 274, pp 610-613 (1996)). 

The diagnostic assays offer a process for diagnosing or 
determining a susceptibility to the Diseases through detec- 
tion of mutation in the GPR25 gene by the methods 
described. In addition, such diseases may be diagnosed by 
methods comprising determining from a sample derived 
from a subject an abnormally decreased or increased level of 
polypeptide or mRNA. Decreased or increased expression 
can be measured at the RNA level using any of the methods 
well known in the art for the quantitation of polynucleotides, 
such as, for example, nucleic acid amplification, for instance 
PCR, RT-PCR, RNase protection, Northern blotting and 
other hybridization methods. Assay techniques that can be 
used to determine levels of a protein, such as a polypeptide 
of the present invention, in a sample derived from a host are 
well-known to those of skill in the art. Such assay methods 
include radioimmunoassays, competitive-binding assays, 
Western Blot analysis and ELISA assays. 

Thus in another aspect, the present invention relates to a 
diagonostic kit which comprises: 

(a) a polynucleotide of the present invention, preferably 
the nucleotide sequence of SEQ ID NO:l, or a fragment 
thereof; 

(b) a nucleotide sequence complementary to that of (a); 

(c) a polypeptide of the present invention, preferably the 
polypeptide of SEQ ID NO:2 or a fragment thereof; or 

(d) an antibody to a polypeptide of the present invention, 
preferably to the polypeptide of SEQ ID NO:2. 

It will be appreciated that in any such kit, (a), (b), (c) or 
(d) may comprise a substantial component. Such a kit will 
be of use in diagnosing a disease or suspectability to a 
disease, particularly infections such as bacterial, fungal, 
protozoan and viral infections, particularly infections caused 
by HIV-1 or HIV-2; pain; cancers; diabetes, obesity; anor- 
exia; bulimia; asthma; Parkinson's disease; acute heart fail- 
ure; hypotension; hypertension; urinary retention; 
osteoporosis; angina pectoris; myocardial infarction; stroke; 
ulcers; asthma; allergies; benign prostatic hypertrophy; 
migraine; vomiting; psychotic and neurological disorders, 
including anxiety, schizophrenia, manic depression, 
depression, delirium, dementia, and severe mental retarda- 
tion; and dyskinesias, such as Huntington's disease or Gilles 
dela Tourett's syndrome, amongst others. 

The nucleotide sequences of the present invention are also 
valuable for chromosome identification. The sequence is 
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specifically targeted to, and can hybridize with, a particular 
location on an individual human chromosome. The mapping 
of relevant sequences to chromosomes according to the 
present invention is an important first step in correlating 
those sequences with gene associated disease. Once a 5 
sequence has been mapped to a precise chromosomal 
location, the physical position of the sequence on the chro- 
mosome can be correlated with genetic map data. Such data 
are found in, for example, V. McKusick Mendelian Inher- 
itance in Man (available on-line through Johns Hopkins 
University Welch Medical Library). The relationship 
between genes and diseases that have been mapped to the 
same chromosomal region are then identified through link- 
age analysis (coinheritance of physically adjacent genes). 

The differences in the cDNA or genomic sequence 
between affected and unaffected individuals can also be 15 
determined. If a mutation is observed in some or all of the 
affected individuals but not in any normal individuals, then 
the mutation is likely to be the causative agent of the disease. 

The polypeptides of the invention or their fragments or 
analogs thereof, or cells expressing them, can also be used 20 
as immunogens to produce antibodies immunospecific for 
polypeptides of the present invention. The term "immuno- 
specific" means that the antibodies have substantially greater 
affinity for the polypeptides of the invention than their 
affinity for other related polypeptides in the prior art. 25 

Antibodies generated against polypeptides of the present 
invention may be obtained by administering the polypep- 
tides or epitope-bearing fragments, analogs or cells to an 
animal, preferably a non-human animal, using routine pro- 
tocols. For preparation of monoclonal antibodies, any tech- 30 
nique which provides antibodies produced by continuous 
cell line cultures can be used. Examples include the hybri- 
doma technique (Kohler, G. and Milstein, C, Nature (1975) 
256:495-497), the trioma technique, the human B-cell 
hybridoma technique (Kozbor et al., Immunology Today 35 
(1983) 4:72) and the EBV-hybridoma technique (Cole et al., 
MONOCLONAL ANTIBODIES AND CANCER 
THERAPY, pp. 77-96, Alan R. Liss, Inc., 1985). 

Techniques for the production of single chain antibodies, 
such as those described in U.S. Pat. No. 4,946,778, can also 40 
be adapted to produce single chain antibodies to polypep- 
tides of this invention. Also, transgenic mice, or other 
organisms, including other mammals, may be used to 
express humanized antibodies. 

The above-described antibodies may be employed to 45 
isolate or to identify clones expressing the polypeptide or to 
purify the polypeptides by affinity chromatography. 

Antibodies against polypeptides of the present invention 
may also be employed to treat the Diseases, amongst others. 

In a further aspect, the present invention relates to geneti- 50 
cally engineered soluble fusion proteins comprising a 
polypeptide of the present invention, or a fragment thereof, 
and various portions of the constant regions of heavy or light 
chains of immunoglobulins of various subclasses (IgG, IgM, 
IgA, IgE). Preferred as an immunoglobulins is the constant 55 
part of the heavy chain of human IgG, particularly IgGl, 
where fusion takes place at the hinge region. In a particular 
embodiment, the Fc part can be removed simply by incor- 
poration of a cleavage sequence which can be cleaved with 
blood clotting factor Xa. Furthermore, this invention relates 60 
to processes for the preparation of these fusion proteins by 
genetic engineering, and to the use thereof for drug 
screening, diagnosis and therapy. A further aspect of the 
invention also relates to polynucleotides encoding such 
fusion proteins. Examples of fusion protein technology can 65 
be found in International Patent Application Nos. W094/ 
29458 and W094/22914. 
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Another aspect of the invention relates to a method for 
inducing an immunological response in a mammal which 
comprises inoculating the mammal with a polypeptide of the 
present invention, adequate to produce antibody and/or T 
cell immune response to protect said animal from the 
Diseases hereinbefore mentioned, amongst others. Yet 
another aspect of the invention relates to a method of 
inducing immunological response in a mammal which 
comprises, delivering a polypeptide of the present invention 
via a vector directing expression of the polynucleotide and 
coding for the polypeptide in vivo in order to induce such an 
immunological response to produce antibody to protect said 
animal from diseases. 

A further aspect of the invention relates to an 
immunological/vaccine formulation (composition) which, 
when introduced into a mammalian host, induces an immu- 
nological response in that mammal to a polypeptide of the 
present invention wherein the composition comprises a 
polypeptide or polynucleotide of the present invention. The 
vaccine formulation may further comprise a suitable carrier. 
Since a polypeptide may be broken down in the stomach, it 
is preferably administered parenterally (for instance, 
subcutaneous, intramuscular, intravenous, or intradermal 
injection). Formulations suitable for parenteral administra- 
tion include aqueous and non-aqueous sterile injection solu- 
tions which may contain anti-oxidants, buffers, bacteriostats 
and solutes which render the formulation instonic with the 
blood of the recipient; and aqueous and non-aqueous sterile 
suspensions which may include suspending agents or thick- 
ening agents. The formulations may be presented in unit- 
dose or multi-dose containers, for example, sealed ampoules 
and vials and may be stored in a freeze-dried condition 
requiring only the addition of the sterile liquid carrier 
immediately prior to use. The vaccine formulation may also 
include adjuvant systems for enhancing the immunogenicity 
of the formulation, such as oil-in water systems and other 
systems known in the art. The dosage will depend on the 
specific activity of the vaccine and can be readily deter- 
mined by routine experimentation. 

Polypeptides of the present invention are responsible for 
many biological functions, including many disease states, in 
particular the Diseases hereinbefore mentioned. It is there- 
fore desirous to devise screening methods to identify com- 
pounds which stimulate or which inhibit the function of the 
polypeptide. Accordingly, in a further aspect, the present 
invention provides for a method of screening compounds to 
identify those which stimulate or which inhibit the function 
of the polypeptide. In general, agonists or antagonists may 
be employed for therapeutic and prophylactic purposes for 
such Diseases as hereinbefore mentioned. Compounds may 
be identified from a variety of sources, for example, cells, 
cell-free preparations, chemical libraries, and natural prod- 
uct mixtures. Such agonists, antagonists or inhibitors 
so-identified may be natural or modified substrates, ligands, 
receptors, enzymes, etc., as the case may be, of the polypep- 
tide; or may be structural or functional mime tics thereof (see 
Coligan et al., Current Protocols in Immunology 
l(2):Chapter 5 (1991)). 

The screening method may simply measure the binding of 
a candidate compound to the polypeptide, or to cells or 
membranes bearing the polypeptide, or a fusion protein 
thereof by means of a label directly or indirectly associated 
with the candidate compound. Alternatively, the screening 
method may involve competition with a labeled competitor. 
Further, these screening methods may test whether the 
candidate compound results in a signal generated by acti- 
vation or inhibition of the polypeptide, using detection 
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systems appropriate to the cells bearing the polypeptide. 
Inhibitors of activation are generally assayed in the presence 
of a known agonist and the effect on activation by the 
agonist by the presence of the candidate compound is 
observed. Constitutively active polpypeptides may be 5 
employed in screening methods for inverse agonists or 
inhibitors, in the absence of an agonist or inhibitor, by 
testing whether the candidate compound results in inhibition 
of activation of the polypeptide. Further, the screening 
methods may simply comprise the steps of mixing a candi- 
date compound with a solution containing a polypeptide of 
the present invention, to form a mixture, measuring GPR25 
activity in the mixture, and comparing the GPR25 activity of 
the mixture to a standard. Fusion proteins, such as those 
made from Fc portion and GPR25 polypeptide, as herein- 
before described, can also be used for high-throughput 
screening assays to identify antagonists for the polypeptide 
of the present invention (see D. Bennett et al., J Mol 
Recognition, 8:52-58 (1995); and K. Johanson et al., J Biol 
Chem, 270(16):9459-9471 (1995)). 2Q 

The polynucleotides, polypeptides and antibodies to the 
polypeptide of the present invention may also be used to 
configure screening methods for detecting the effect of 
added compounds on the production of mRNA and polypep- 
tide in cells. For example, an ELISA assay may be con- 2S 
structed for measuring secreted or cell associated levels of 
polypeptide using monoclonal and polyclonal antibodies by 
standard methods known in the art. This can be used to 
discover agents which may inhibit or enhance the production 
of polypeptide (also called antagonist or agonist, 3Q 
respectively) from suitably manipulated cells or tissues. 

The polypeptide may be used to identify membrane bound 
or soluble receptors, if any, through standard receptor bind- 
ing techniques known in the art. These include, but are not 
limited to, ligand binding and crosslinking assays in which 35 
the polypeptide is labeled with a radioactive isotope (for 
instance, 125 I), chemically modified (for instance, 
biotinylated), or fused to a peptide sequence suitable for 
detection or purification, and incubated with a source of the 
putative receptor (cells, cell membranes, cell supernatants, 4Q 
tissue extracts, bodily fluids). Other methods include bio- 
physical techniques such as surface plasmon resonance and 
spectroscopy. These screening methods may also be used to 
identify agonists and antagonists of the polypeptide which 
compete with the binding of the polypeptide to its receptors, 45 
if any. Standard methods for conducting such assays are well 
understood in the art. 

Examples of potential polypeptide antagonists include 
antibodies or, in some cases, oligonucleotides or proteins 
which are closely related to the ligands, substrates, 5Q 
receptors, enzymes, etc., as the case may be, of the 
polypeptide, e.g., a fragment of the ligands, substrates, 
receptors, enzymes, etc.; or small molecules which bind to 
the polypeptide of the present invention but do not elicit a 
response, so that the activity of the polypeptide is prevented. 55 

Thus, in another aspect, the present invention relates to a 
screening kit for identifying agonists, antagonists, ligands, 
receptors, substrates, enzymes, etc. for polypeptides of the 
present invention; or compounds which decrease or enhance 
the production of such polypeptides, which comprises: 

(a) a polypeptide of the present invention; 

(b) a recombinant cell expressing a polypeptide of the 
present invention; 

(c) a cell membrane expressing a polypeptide of the 
present invention; or 65 

(d) antibody to a polypeptide of the present invention; 
which polypeptide is preferably that of SEQ ID NO:2. 
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It will be appreciated that in any such kit, (a), (b), (c) or 
(d) may comprise a substantial component. 

It will be readily appreciated by the skilled artisan that a 
polypeptide of the present invention may also be used in a 
method for the structure-based design of an agonist, antago- 
nist or inhibitor of the polypeptide, by: 

(a) determining in the first instance the three-dimensional 
structure of the polypeptide; 

(b) deducing the three-dimensional structure for the likely 
reactive or binding site(s) of an agonist, antagonist or 
inhibitor; 

(c) synthesing candidate compounds that are predicted to 
bind to or react with the deduced binding or reactive 
site; and 

(d) testing whether the candidate compounds are indeed 
agonists, antagonists or inhibitors. 

It will be further appreciated that this will normally be an 
interative process. 

In a further aspect, the present invention provides meth- 
ods of treating abnormal conditions such as, for instance, 
infections such as bacterial, fungal, protozoan and viral 
infections, particularly infections caused by HIV-1 or HIV- 
2; pain; cancers; diabetes, obesity; anorexia; bulimia; 
asthma; Parkinson's disease; acute heart failure; hypoten- 
sion; hypertension; urinary retention; osteoporosis; angina 
pectoris; myocardial infarction; stroke; ulcers; asthma; aller- 
gies; benign prostatic hypertrophy; migraine; vomiting; psy- 
chotic and neurological disorders, including anxiety, 
schizophrenia, manic depression, depression, delirium, 
dementia, and severe mental retardation; and dyskinesias, 
such as Huntington's disease or Gilles dela Tourett's 
syndrome, related to either an excess of, or an under- 
expression of, GPR25 polypeptide activity. 

If the activity of the polypeptide is in excess, several 
approaches are available. One approach comprises admin- 
istering to a subject in need thereof an inhibitor compound 
(antagonist) as hereinabove described, optionally in combi- 
nation with a pharmaceutically acceptable carrier, in an 
amount effective to inhibit the function of the polypeptide, 
such as, for example, by blocking the binding of ligands, 
substrates, receptors, enzymes, etc., or by inhibiting a sec- 
ond signal, and thereby alleviating the abnormal condition. 
In another approach, soluble forms of the polypeptides still 
capable of binding the ligand, substrate, enzymes, receptors, 
etc. in competition with endogenous polypeptide may be 
administered. Typical examples of such competitors include 
fragments of the GPR25 polypeptide. 

In still another approach, expression of the gene encoding 
endogenous GPR25 polypeptide can be inhibited using 
expression blocking techniques. Known such techniques 
involve the use of antisense sequences, either internally 
generated or separately administered (see, for example, 
O'Connor, JNeurochem (1991) 56:560 in Oligodeoxynucle- 
otides as Antisense Inhibitors of Gene Expression, CRC 
Press, Boca Raton, Fla. (1988)). Alternatively, oligonucle- 
otides which form triple helices with the gene can be 
supplied (see, for example, Lee et al., Nucleic Acids Res 
(1979) 6:3073; Cooney et al., Science (1988)241:456; Der- 
van et al., Science (1991)251: 1360). These oligomers can be 
administered per se or the relevant oligomers can be 
expressed in vivo. 

For treating abnormal conditions related to an under- 
expression of GPR25 and its activity, several approaches are 
also available. One approach comprises administering to a 
subject a therapeutically effective amount of a compound 
which activates a polypeptide of the present invention, i.e., 
an agonist as described above, in combination with a phar- 
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maceutically acceptable carrier, to thereby alleviate the 
abnormal condition. Alternatively, gene therapy may be 
employed to effect the endogenous production of GPR25 by 
the relevant cells in the subject. For example, a polynucle- 
otide of the invention may be engineered for expression in 5 
a replication defective retroviral vector, as discussed above. 
The retroviral expression construct may then be isolated and 
introduced into a packaging cell transduced with a retroviral 
plasmid vector containing RNA encoding a polypeptide of 
the present invention such that the packaging cell now 10 
produces infectious viral particles containing the gene of 
interest. These producer cells may be administered to a 
subject for engineering cells in vivo and expression of the 
polypeptide in vivo. For an overview of gene therapy, see 
Chapter 20, Gene Therapy and other Molecular Genetic- 15 
based Therapeutic Approaches, (and references cited 
therein) in Human Molecular Genetics, T Strachan and A P 
Read, BIOS Scientific Publishers Ltd (1996). Another 
approach is to administer a therapeutic amount of a polypep- 
tide of the present invention in combination with a suitable 20 
pharmaceutical carrier. 

In a further aspect, the present invention provides for 
pharmaceutical compositions comprising a therapeutically 
effective amount of a polypeptide, such as the soluble form 
of a polypeptide of the present invention, agonist/antagonist 25 
peptide or small molecule compound, in combination with a 
pharmaceutical^ acceptable carrier or excipient. Such car- 
riers include, but are not limited to, saline, buffered saline, 
dextrose, water, glycerol, ethanol, and combinations thereof. 
The invention further relates to pharmaceutical packs and 30 
kits comprising one or more containers filled with one or 
more of the ingredients of the aforementioned compositions 
of the invention. Polypeptides and other compounds of the 
present invention may be employed alone or in conjunction 
with other compounds, such as therapeutic compounds. 35 

The composition will be adapted to the route of 
administration, for instance by a systemic or an oral route. 
Preferred forms of systemic administration include 
injection, typically by intravenous injection. Other injection 
routes, such as subcutaneous, intramuscular, or 40 
intraperitoneal, can be used. Alternative means for systemic 
administration include transmucosal and transdermal admin- 
istration using penetrants such as bile salts or fusidic acids 
or other detergents. In addition, if a polypeptide or other 
compounds of the present invention can be formulated in an 45 
enteric or an encapsulated formulation, oral administration 
may also be possible. Administration of these compounds 
may also be topical and/or localized, in the form of salves, 
pastes, gels, and the like. 

The dosage range required depends on the choice of 50 
peptide or other compounds of the present invention, the 
route of administration, the nature of the formulation, the 
nature of the subject's condition, and the judgment of the 
attending practitioner. Suitable dosages, however; are in the 
range of 0.1-100 //g/kg of subject. Wide variations in the 55 
needed dosage, however, are to be expected in view of the 
variety of compounds available and the differing efficiencies 
of various routes of administration. For example, oral 
administration would be expected to require higher dosages 
than administration by intravenous injection. Variations in 60 
these dosage levels can be adjusted using standard empirical 
routines for optimization, as is well understood in the art. 

Polypeptides used in treatment can also be generated 
endogenously in the subject, in treatment modalities often 
referred to as "gene therapy" as described above. Thus, for 65 
example, cells from a subject may be engineered with a 
polynucleotide, such as a DNA or RNA, to encode a 



polypeptide ex vivo, and for example, by the use of a 
retroviral plasmid vector. The cells are then introduced into 
the subject. 

Polynucleotide and polypeptide sequences form a valu- 
able information resource with which to identify further 
sequences of similar homology. This is most easily facili- 
tated by storing the sequence in a computer readable 
medium and then using the stored data to search a sequence 
database using well known searching tools, such as GCC. 
Accordingly, in a further aspect, the present invention pro- 
vides for a computer readable medium having stored thereon 
a polynucleotide comprising the sequence of SEQ ID NO:l 
and/or a polypeptide sequence encoded thereby. 

The following definitions are provided to facilitate under- 
standing of certain terms used frequently hereinbefore. 

"Antibodies" as used herein includes polyclonal and 
monoclonal antibodies, chimeric, single chain, and human- 
ized antibodies, as well as Fab fragments, including the 
products of an Fab or other immunoglobulin expression 
library. 

"Isolated" means altered "by the hand of man" from the 
natural state. If an "isolated" composition or substance 
occurs in nature, it has been changed or removed from its 
original environment, or both. For example, a polynucle- 
otide or a polypeptide naturally present in a living animal is 
not "isolated," but the same polynucleotide or polypeptide 
separated from the coexisting materials of its natural state is 
"isolated", as the term is employed herein. 

"Polynucleotide" generally refers to any polyribonucle- 
otide or polydeoxribonucleotide, which may be unmodified 
RNA or DNA or modified RNA or DNA. "Polynucleotides" 
include, without limitation, single- and double-stranded 
DNA, DNA that is a mixture of single- and double-stranded 
regions, single- and double-stranded RNA, and RNA that is 
mixture of single- and double -stranded regions, hybrid mol- 
ecules comprising DNA and RNA that may be single- 
stranded or, more typically, double-stranded or a mixture of 
single- and double-stranded regions. In addition, "poly- 
nucleotide" refers to triple-stranded regions comprising 
RNA or DNA or both RNA and DNA. The term "polynucle- 
otide" also includes DNAs or RNAs containing one or more 
modified bases and DNAs or RNAs with backbones modi- 
fied for stability or for other reasons. "Modified" bases 
include, for example, tritylated bases and unusual bases such 
as inosine. A variety of modifications may be made to DNA 
and RNA; thus, "polynucleotide" embraces chemically, 
enzymatically or metabolically modified forms of poly- 
nucleotides as typically found in nature, as well as the 
chemical forms of DNA and RNA characteristic of viruses 
and cells. "Polynucleotide" also embraces relatively short 
polynucleotides, often referred to as oligonucleotides. 

"Polypeptide" refers to any peptide or protein comprising 
two or more amino acids joined to each other by peptide 
bonds or modified peptide bonds, i.e., peptide isosteres. 
"Polypeptide" refers to both short chains, commonly 
referred to as peptides, oligopeptides or oligomers, and to 
longer chains, generally referred to as proteins. Polypeptides 
may contain amino acids other than the 20 gene -encoded 
amino acids. "Polypeptides" include amino acid sequences 
modified either by natural processes, such as post- 
translational processing, or by chemical modification tech- 
niques which are well known in the art. Such modifications 
are well described in basic texts and in more detailed 
monographs, as well as in a voluminous research literature. 
Modifications may occur anywhere in a polypeptide, includ- 
ing the peptide backbone, the amino acid side-chains and the 
amino or carboxyl termini. It will be appreciated that the 
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same type of modification may be present to the same or 
varying degrees at several sites in a given polypeptide. Also, 
a given polypeptide may contain many types of modifica- 
tions. Polypeptides may be branched as a result of 
ubiquitination, and they may be cyclic, with or without 
branching. Cyclic, branched and branched cyclic polypep- 
tides may result from post-translation natural processes or 
may be made by synthetic methods. Modifications include 
acetylation, acylation, ADP-ribosylation, amidation, cova- 
lent attachment of flavin, covalent attachment of a heme 
moiety, covalent attachment of a nucleotide or nucleotide 
derivative, covalent attachment of a lipid or lipid derivative, 
covalent attachment of phosphotidylinositol, cross-linking, 
cyclization, disulfide bond formation, demethylation, for- 
mation of covalent cross-links, formation of cystine, forma- 
tion of pyroglutamate, formylation, gamma-carboxylation, 
glycosylation, GPI anchor formation, hydroxylation, 
iodination, methylation, myristoylation, oxidation, pro- 
teolytic processing, phosphorylation, prenylation, 
racemization, selenoylation, sulfation, transfer-RNA medi- 
ated addition of amino acids to proteins such as arginylation, 
and ubiquitination (see, for instance, PROTEINS — 
STRUCTURE AND MOLECULAR PROPERTIES, 2nd 
Ed., T. E. Creighton, W. H. Freeman and Company, New 
York, 1993; Wold, F., Post-translational Protein Modifica- 
tions: Perspectives and Prospects, pgs, 1-12 in POST- 
TRANSLATIONAL COVALENT MODIFICATION OF 
PROTEINS, B. C. Johnson, Ed., Academic Press, New 
York, 1983; Seifter et al., "Analysis for protein modifica- 
tions and nonprotein cof actors", Meth Enzymol (1990) 
182:626-646 and Rattan et al., "Protein Synthesis: Post- 
translational Modifications and Aging", Ann NY Acad Sci 
(1992) 663:48-62). 

"Variant" refers to a polynucleotide or polypeptide that 
differs from a reference polynucleotide or polypeptide, but 
retains essential properties. A typical variant of a polynucle- 
otide differs in nucleotide sequence from another, reference 
polynucleotide. Changes in the nucleotide sequence of the 
variant may or may not alter the amino acid sequence of a 
polypeptide encoded by the reference polynucleotide. 
Nucleotide changes may result in amino acid substitutions, 
additions, deletions, fusions and truncations in the polypep- 
tide encoded by the reference sequence, as discussed below. 
A typical variant of a polypeptide differs in amino acid 
sequence from another, reference polypeptide. Generally, 
differences are limited so that the sequences of the reference 
polypeptide and the variant are closely similar overall and, 
in many regions, identical. A variant and reference polypep- 
tide may differ in amino acid sequence by one or more 
substitutions, additions, deletions in any combination. A 
substituted or inserted amino acid residue may or may not be 
one encoded by the genetic code. A variant of a polynucle- 
otide or polypeptide may be a naturally occurring such as an 
allelic variant, or it may be a variant that is not known to 
occur naturally. Non-naturally occurring variants of poly- 
nucleotides and polypeptides may be made by mutagenesis 
techniques or by direct synthesis. 

"Identity" is a measure of the identity of nucleotide 
sequences or amino acid sequences. In general, the 
sequences are aligned so that the highest order match is 
obtained. "Identity" per se has an art-recognized meaning 
and can be calculated using published techniques (see, e.g.: 
COMPUTATIONAL MOLECULAR BIOLOGY, Lesk, A. 
M., ed., Oxford University Press, New York, 1988; BIO- 
COMPUTING: INFORMATICS AND GENOME 
PROJECTS, Smith, D. W., ed., Academic Press, New York, 
1993; COMPUTER ANALYSIS OF SEQUENCE DATA, 
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PART I, Griffin, A. M., and Griffin, H. G., eds., Humana 
Press, New Jersey, 1994; SEQUENCE ANALYSIS IN 
MOLECULAR BIOLOGY, von Heinje, G., Academic 
Press, 1987; and SEQUENCE ANALYSIS PRIMER, 

5 Gribskov, M. and Devereux, J., eds., M Stockton Press, New 
York, 1991). While there exist a number of methods to 
measure identity between two polynucleotide or polypeptide 
sequences, the term "identity" is well known to skilled 
artisans (Carillo, H., and Lipton, D., SIAM J Applied Math 

10 (1988) 48:1073). Methods commonly employed to deter- 
mine identity or similarity between two sequences include, 
but are not limited to, those disclosed in Guide to Huge 
Computers, Martin J. Bishop, ed., Academic Press, San 
Diego, 1994, and Carillo, H., and Lipton, D., SIAM J 

15 Applied Math (1988) 48:1073. Methods to determine iden- 
tity and similarity are codified in computer programs. Pre- 
ferred computer program methods to determine identity and 
similarity between two sequences include, but are not lim- 
ited to, GCG program package (Devereux, J., et al., Nucleic 

20 Acids Research (1984) 12(1):387), BLASTP, BLASTN, and 
FASTA (Atschul, S. F. et al., J Molec Biol (1990) 215:403). 

By way of example, a polynucleotide sequence of the 
present invention may be identical to the reference sequence 
of SEQ ID NO:l, that is be 100% identical, or it may include 

25 up to a certain integer number of nucleotide alterations as 
compared to the reference sequence. Such alterations are 
selected from the group consisting of at least one nucleotide 
deletion, substitution, including transition and transversion, 
or insertion, and wherein said alterations may occur at the 5' 

30 or 3 f terminal positions of the reference nucleotide sequence 
or anywhere between those terminal positions, interspersed 
either individually among the nucleotides in the reference 
sequence or in one or more contiguous groups within the 
reference sequence. The number of nucleotide alterations is 

35 determined by multiplying the total number of nucleotides in 
SEQ ID NO:l by the numerical percent of the respective 
percent identity (divided by 100) and subtracting that prod- 
uct from said total number of nucleotides in SEQ ID NO:l, 
or: 

40 

wherein n M is the number of nucleotide alterations, x n is the 
total number of nucleotides in SEQ ID NO:l, and y is 0.50 
for 50%, 0.60 for 60%, 0.70 for 70%, 0.80 for 80%, 0.85 for 

45 85%, 0.90 for 90%, 0.95 for 95%, 0.97 for 97% or 1.00 for 
100%, and wherein any non-integer product of x„ and y is 
rounded down to the nearest integer prior to subtracting it 
from x„. Alterations of a polynucleotide sequence encoding 
the polypeptide of SEQ ID NO:2 may create nonsense, 

50 missense or frameshift mutations in this coding sequence 
and thereby alter the polypeptide encoded by the polynucle- 
otide following such alterations. 

Similarly, a polypeptide sequence of the present invention 
may be identical to the reference sequence of SEQ ID NO:2, 

55 that is be 100% identical, or it may include up to a certain 
integer number of amino acid alterations as compared to the 
reference sequence such that the % identity is less than 
100%. Such alterations are selected from the group consist- 
ing of at least one amino acid deletion, substitution, includ- 

60 ing conservative and non-conservative substitution, or 
insertion, and wherein said alterations may occur at the 
amino- or carboxy-terminal positions of the reference 
polypeptide sequence or anywhere between those terminal 
positions, interspersed either individually among the amino 

65 acids in the reference sequence or in one or more contiguous 
groups within the reference sequence. The number of amino 
acid alterations for a given % identity is determined by 
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multiplying the total number of amino acids in SEQ ID throughput format. The purified ligand for a receptor is 

NO: 2 by the numerical percent of the respective percent radiolabeled to high specific activity (50-2000 Ci/mmol) for 

identity (divided by 100) and then subtracting that product binding studies. A determination is then made that the 

from said total number of amino acids in SEQ ID NO:2, or: process of radiolabeling does not diminish the activity of the 

n a = x .-( x *y)> 5 hg an d towards its receptor. Assay conditions for buffers, 

wherein n fl 'is the number of amino acid alterations, x a is the ions ' ™ 6 ot ^ r J™^™ ^ f nudc0tid f S 

total number of amino acids in SEQ ID NO:2, and y is, for optimized to establish a workable Signal to noise ratio for 

instance 0.70 for 70%, 0.80 for 80%, 0.85 for 85% etc., and both membrane and whole cell receptor sources. For these 

wherein any non-integer product of x a and y is rounded assavs > specific receptor binding is defined as total associ- 

down to the nearest integer prior to subtracting it from x B . 10 ated radioactivity minus the radioactivity measured in the 

"Fusion protein" refers to a protein encoded by two, often presence of an excess of unlabeled competing ligand. Where 

unrelated, fused genes or fragments thereof. In one example, possible, more than one competing ligand is used to define 

EP-A-0 464 discloses fusion proteins comprising various residual nonspecific binding, 
portions of constant region of immunoglobulin molecules 

together with another human protein or part thereof. In many 15 Example 4 
cases, employing an immunoglobulin Fc region as a part of 

a fusion protein is advantageous for use in therapy and Functional Assay in Xenopus Oocytes 

Capped RNA transcripts from linearized plasmid tern- 

netic properties [see, e.g., EP-A 0232 262]. On the other plateg encoding the receptor cDNAs of the invention are 

hand, for some uses it would be desirable to be able to delete 20 synt hesized in vitro with RNA polymerases in accordance 

the Fc part after the fusion protem has been expressed, witn standard procedures. In vitro transcripts are suspended 

detected and purified. j n wa t e r at a final concentration of 0.2 mg/ml. Ovarian lobes 

Ail publications, including but not limited to patents and are removed from adult female toads, Stage V defolliculated 

patent applications, cited in this specification are herein oocytes are obtained, and RNA transcripts (10 ng/oocyte) 

incorporated by reference as if each individual publication 25 are injected in a 50 nl bolus using a microinjection appara- 

were specifically and individually indicated to be incorpo- tus. Two electrode voltage clamps are used to measure the 

rated by reference herein as though fully set forth. currents from individual Xenopus oocytes in response to 

EXAMPLES agonist exposure. Recordings are made in Ca2+ free Barth's 

medium at room temperature. The Xenopus system can be 

Example 1 30 used to screen known ligands and tissue/cell extracts for 

w ^ „ „ . activating ligands. 

Mammalian Cell Expression 0 0 

The receptors of the present invention are expressed in Example 5 
either human embryonic kidney 293 (HEK293) cells or 

adherent dhfr CHO cells. To maximize receptor expression, 35 Microphysiometric Assays 

typically all 5' and 3' untranslated regions (UTRs) are * * . c . , . 4 r , 

1 , f 4 mTA . , . V. Activation of a wide variety of secondary messenger 

removed from the receptor cDNA prior to insertion into a . l4 . 4 . * « • * • j * 

nT . XT ™ M * j» mi K, . c , . . systems results in extrusion of small amounts of acid from 

pCDN or pCDNA3 vector. The cells are transfected with ^ ^ ^ formed fa j d as a ^ of ^ 

mdividual receptor cDNAs by hpofechn and selected in the metabolic activity required t0 &el the intracellular signaling 

presence of 400 mg/ml G418. After 3 weeks of selection, 4Q process . Th e pH changes in the media surrounding the cell 

individual clones are picked and expanded for further analy- are very small but are detectable by the CYTOSENSOR 

sis. HEK293 or CHO cells transfected with the vector alone microphysiometer (Molecular Devices Ltd., Menlo Park, 

serve as negative controls. To isolate cell lines stably Calif.). The CYTOSENSOR is thus capable of detecting the 

expressing the individual receptors, about 24 clones are activation of a receptor which is coupled to an energy 

typically selected and analyzed by Northern blot analysis. 4J utilizing intracellular signaling pathway such as the 

Receptor mRNAs are generally detectable in about 50% of G-protein coupled receptor of the present invention. 



the G418-resistant clones analyzed. 

Example 2 



Example 6 



Ligand bank for binding and functional assays 5Q Extract/Cell Supernatant Screening 

A bank of over 200 putative receptor ligands has been A large num ber of mammalian receptors exist for which 

assembled for screening. The bank comprises: transmitters, there remains, as yet, no cognate activating ligand (agonist), 

hormones and chemokines known to act via a human seven Thus, active ligands for these receptors may not be included 

transmembrane (7TM) receptor; naturally occurring com- within the ligands banks as identified to date. Accordingly, 

pounds which may be putative agonists for a human 7TM 55 the 7TM receptor of the invention is also functionally 

receptor, non-mammalian, biologically active peptides for screened (using calcium, cAMP, microphysiometer, oocyte 

which a mammalian counterpart has not yet been identified; electrophysiology, etc., functional screens) against tissue 

and compounds not found in nature, but which activate 7TM extracts to identify natural ligands. Extracts that produce 

receptors with unknown natural ligands. This bank is used to positive functional responses can be sequentially subfrac- 

initially screen the receptor for known ligands, using both 60 tionated until an activating ligand is isolated and identified, 
functional (i.e . calcium, cAMP, microphysiometer, oocyte 

electrophysiology, etc, see below) as well as binding assays. Example 8 

Example 3 Calcium and cAMP Functional Assays 

Ligand Binding Assays 65 Jm recept0IS which are expr essed in HEK 293 cells 

Ligand binding assays provide a direct method for ascer- have been shown to be coupled functionally to activation of 

taining receptor pharmacology and are adaptable to a high PLC and calcium mobilization and/or cAMP stimulation or 
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inhibition. Basal calcium levels in the HEK 293 cells in 
receptor-transfected or vector control cells were observed to 
be in the normal, 100 nM to 200 nM, range. HEK 293 cells 
expressing recombinant receptors are loaded with fura 2 and 
in a single day > 150 selected ligands or tissue/cell extracts 
are evaluated for agonist induced calcium mobilization. 
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Similarly, HEK 293 cells expressing recombinant receptors 
are evaluated for the stimulation or inhibition of cAMP 
production using standard cAMP quantitation assays. Ago- 
nists presenting a calcium transient or cAMP flucuation are 
tested in vector control cells to determine if the response is 
unique to the transfected cells expressing receptor. 



SEQUENCE LISTING 

(1) GENERAL INFORMATION: 

(iii) NUMBER OF SEQUENCES: 2 

(2) INFORMATION FOR SEQ ID NO:l: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 1174 base pairs 

(B) TYPE : nucleic acid 

(C) STRANDEDNESS : single 

(D) TOPOLOGY : linear 

(ii) MOLECULE TYPE: cDNA 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO:l: 

AGAGCAAACC CCCTCCTGCT CAGAGCTGCT GCCGCCTGCG CCCAGGGCTG CACTCCGCGC 60 

AGGCCTCATA GCCAGGCCAT GGCCCCCACA GAGCCCTGGA GCCCCAGCCC GGGGTCAGCG 120 

CCCTGGGACT ACTCGGGGTT GGACGGCCTG GAGGAGCTGG AGCTGTGTCC GGCCGGGGAC 180 

CTGCCCTACG GCTACGTCTA CATCCCCGCG CTCTACCTGG CGGCCTTCGC CGTGGGCCTG 240 

CTGGGCAACG CCTTTGTGGT GTGGCTGCTG GCCGGGCGGC GGGGCCCGCG GCGGCTGGTG 300 

GATACCTTCG TGCTGCACCT GGCGGCAGCT GACCTGGGCT TCGTGCTCAC GCTGCCGCTG 360 

TGGGCCGCGG CGGCGGCGCT AGGCGGCCGC TGGCCGTTCG GCGATGGCCT CTGCAAGCTC 420 

AGCAGCTTCG CGCTGGCGGG CACGCGCTGC GCGGGCGCGC TGCTGCTGGC GGGCATGAGC 4 80 

GTGGACCGCT ACCTGGCCGT GGTGAAGCTG CTCGAGGCGA GGCCACTGCG CACCCCGCGC 540 

TGCGCGCTGG CCTCGTGCTG CGGCGTCTGG GCCGTGGCGC TGCTGGCCGG CCTGCCCTCC 600 

CTGGTCTACC GGGGGTTGCA GCCCCTGCCT GGGGGCCAGG ACAGCCAGTG CGGCGAGGAG 660 

CCCTCCCACG CCTTCCAGGG CCTCAGCTTG CTGCTGCTGC TGCTGACCTT CGTGCTGCCC 720 

CTGGTCGTCA CCCTCTTCTG CTACTGCCGC ATCTCGCGCC GCCTGCGACG GCCGCCGCAC 780 

GTGGGTCGGG CCCGGAGGAA CTCGCTGCGC ATCATCTTCG C CATC GAG AG CACGTTTGTG 840 

GGCTCCTGGC TGCCCTTCAG CGCCCTGCGG GCCGTCTTCC ACCTGGCGCG TCTGGGGGCG 900 

CTGCCGCTGC CGTGCCCCCT GCTGCTGGCG CTGCGCTGGG GCCTCACCAT TGCCACCTGC 960 

CTGGCCTTCG TCAACAGCTG CGCCAACCCG CTCATCTACC TCCTGCTGGA CCGCTCATTC 1020 

CGAGCCCGGG CGCTGGACGG GGCCTGCGGG CGCACCGGCC GCCTGGCGCG AAGGATCAGC 1080 

TCAGCCTCCT CGCTCTCCAG GGACGACAGT TCCGTGTTCC GTTGCCGGGC CCAGGCCGCG 1140 

AACACTGCCT CGGCCTCCTG GTAGAAGCTT CGGG 1174 

(2) INFORMATION FOR SEQ ID NO: 2: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 361 amino acids 

(B) TYPE : amino acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(ii) MOLECULE TYPE: protein 
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-continued 





(xi) 


SEQUENCE DESCRIPTION: SEQ ID NO 














Met 
1 


Ala 


Pro 


Thr 


Glu 
5 


Pro 


Trp 


Ser 


Pro 


Ser 
10 


Pro 


Gly 


Ser 


Ala 


Pro 
15 


Trp 


Asp 


Tyr 


Ser 


Gly 
20 


Leu 


Asp 


Gly 


Leu 


Glu 
25 


Glu 


Leu 


Glu 


Leu 


Cys 
30 


Pro 


Ala 


Gly 


Asp 


Leu 
35 


Pro 


Tyr 


Gly 


Tyr 


WW 1 

Val 
40 


Tyr 


He 


Pro 


Ala 


Leu 
45 


Tyr 


Leu 


Ala 


Ala 


Phe 
50 


— * 

Ala 


Val 


Gly 


Leu 


Leu 
55 


Gly 


Asn 


Ala 


Phe 


Val 
60 


Val 


Trp 


Leu 


Leu 


Ala 
65 


Gly 


Arg 


Arg 


Gly 


Pro 
70 


Arg 


Arg 


Leu 


Val 


Asp 
75 


Thr 


Phe 


Val 


Leu 


His 
80 


Leu 


Ala 


Ala 


Ala 


Asp 
85 


Leu 


Gly 


Phe 


Val 


T - - - 

Leu 
90 


Thr 


Leu 


Pro 


Leu 


Trp 
95 


Ala 


Ala 


Ala 


Ala 


Ala 
100 


Leu 


Gly 


Gly 


Arg 


Trp 
105 


Pro 


Phe 


Gly 


Asp 


Gly 
110 


Leu 


Cys 


Lys 


Leu 


Ser 
115 


Ser 


Phe 


Ala 


Leu 


Ala 

120 


Gly 


Thr 


Arg 


Cys 


Ala 
125 


Gly 


Ala 


Leu 


Leu 


Leu 
130 


Ala 


Gly 


Met 


Ser 


Val 

135 


Asp 


Arg 


Tyr 


Leu 


Ala 

140 


Val 


Val 


Lys 


Leu 


Leu 
145 


Glu 


Ala 


Arg 


Pro 


Leu 
150 


Arg 


rn1_ 

Thr 


Pro 


Arg 


Cys 
155 


Ala 


Leu 


Ala 


Ser 


Cys 
160 


CyB 


Gly 


Val 


Trp 


Ala 
165 


Val 


Ala 


Leu 


Leu 


Ala 
170 


Gly 


Leu 


Pro 


Ser 


Leu 
175 


Val 


Tyr 


Arg 


Gly 


Leu 
180 


Gin 


Pro 


Leu 


Pro 


Gly 
185 


Gly 


Gin 


Asp 


Ser 


Gin 
190 


Cys 


Gly 


Glu 


Glu 


Pro 
195 


Ser 


His 


Ala 


Phe 


Gin 
200 


Gly 


Leu 


Ser 


Leu 


Leu 
205 


Leu 


Leu 


Leu 


Leu 


Thr 
210 


Phe 


Val 


Leu 


Pro 


Leu 
215 


Val 


WW 1 

Val 


Thr 


Leu 


Phe 
220 


Cys 


Tyr 


Cys 


Arg 


lie 
225 


Ser 


Arg 


Arg 


Leu 


Arg 
230 


Arg 


Pro 


Pro 


TV * 

His 


Val 
235 


Gly 


Arg 


Ala 


Arg 


Arg 
240 


Asn 


Ser 


Leu 


Arg 


He 
245 


mm. 1 

He 


Phe 


Ala 


lie 


Glu 
250 


Ser 


Thr 


Phe 


Val 


Gly 
255 


Ser 


Trp 


Leu 


Pro 


Phe 
260 


Ser 


Ala 


Leu 


Arg 


m. 4 

Ala 
265 


Val 


Phe 


His 


Leu 


Ala 
270 


Arg 


Leu 


Gly 


Ala 


Leu 
275 


Pro 


Leu 


Pro 


Cys 


Pro 
280 


Leu 


Leu 


Leu 


Ala 


Leu 
285 


Arg 


Trp 


Gly 


Leu 


Thr 
290 


He 


Ala 


Thr 


Cys 


Leu 
295 


Ala 


Phe 


Val 


Asn 


Ser 
300 


Cys 


Ala 


Asn 


Pro 


Leu 
305 


He 


Tyr 


Leu 


Leu 


Leu 
310 


Asp 


Arg 


Ser 


Phe 


Arg 
315 


Ala 


Arg 


Ala 


Leu 


Asp 
320 


Gly 


Ala 


Cys 


Gly 


Arg 
325 


Thr 


Gly 


Arg 


Leu 


Ala 
330 


Arg 


Arg 


He 


Ser 


Ser 
335 


Ala 


Ser 


Ser 


Leu 


Ser 
340 


Arg 


Asp 


Asp 


Ser 


Ser 
345 


Val 


Phe 


Arg 


Cys 


Arg 
350 


Ala 


Gin 


Ala 


Ala 


Asn 
355 


Thr 


Ala 


Ser 


Ala 


Ser 
360 


Trp 

















What is claimed is: 

1. An isolated polynucleotide comprising a nucleotide 
sequence encoding the polypeptide comprising the amino 
acid sequence as set forth in SEQ ID NO: 2. 65 

2. The isolated polynucleotide of claim 1 comprising the 
polynucleotide as set forth in SEQ ID NO:l. 



3. The isolated polynucleotide of claim 1 comprising 
nucleotides 79 to 1161 of the polynucleotide as set forth in 
SEQ ID N0:1. 

4. An isolated expression vector comprising a polynucle- 
otide encoding a polypeptide having the amino acid 
sequence as set forth in SEQ ID NO:2. 
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5. A process for producing a recombinant host cell com- 
prising transforming or transfecting host a cell with the 
expression vector of claim 4 such that the host cell, under 
appropriate culture conditions, produces a polypeptide com- 
prising the amino acid sequence as set forth in SEQ ID 
NO:2. 

6. A recombinant host cell produced by the process of 
claim 5. 

7. An isolated membrane of the recombinant host cell of 
claim 6 expressing a polypeptide comprising the amino acid 
sequence as set forth in SEQ ID NO: 2. 

8. A process for producing a polypeptide having the amino 
acid sequence as set forth in SEQ ID NO: 2 comprising 
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culturing the host cell of claim 6 under conditions sufficient 
for the production of said polypeptide and recovering said 
polypeptide from the culture. 

9. An isolated polynucleotide which is fully complemen- 
tary to the nucleotide sequence encoding the amino acid 
sequence as set forth in SEQ ID NO:2. 

10. An isolated polynucleotide which is fully complemen- 
tary to the polynucleotide as set forth in SEQ ID NO:l. 

11. An isolated polynucleotide which is fully complemen- 
tary to nucleotides 79 to 1161 of the polynucleotide as set 
forth in SEQ ID NO:l. 
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